+1 on the download I think that this might fit nicely into some ideas for streaming ingest. For instance data can be ingested via a spark streaming worker, normalized (talk of schema’s aside) and sent to another streaming worker, perhaps a Structured Streaming worker. A Censys DF can be loaded with the latest download and can be used to checks against some conditions. results can be published to a streaming specific table or downstream to another streaming job waiting for suspicious events/bad actors only.
The streaming table could take some thought, or perhaps it’s just a directory of Parquet data that can be loaded and queried. the latter could be troublesome with large implementations. Another option would just be pushing that data to a Solr/Grafana/etc. the UI should have a dashboard making calls to the graphQL api, the question is how to structure a streaming time series dashboard or perhaps using third party tools integrated with Spot UI is the best option. - Nathanael > On Jun 28, 2017, at 10:09 AM, Mark Grover <[email protected]> wrote: > > Thanks for starting this thread, Cesar. > I don't know enough Censys to have a strong opinion on this. > > I just looked around from a licensing and workflow perspective. Their python > client <https://github.com/censys/censys-python> seems to be ASL v2 > licensed so that's a good thing. I think download makes sense as well but > do you have any thoughts on the workflow there? Currently the download > seems to require a user account, so does that mean every update of the > downloaded data from Censys would be manual? Do you know if that can be > automated, with only ASLv2 or compatibly licensed tooling? > > On Wed, Jun 28, 2017 at 9:49 AM, Michael Ridley <[email protected]> > wrote: > >> Agree with others that we should work from a download rather than hitting >> their API. I haven't had a chance to download the data to look at it yet, >> but assuming it's a reasonable size that seems like a better way to go. >> I'm not sure that we would need any REST API in the executors - why not >> just load it into HDFS and read it into a dataframe? What kind of >> enrichment did we have in mind? Just having looked at the web site without >> looking at the data download, it seems more of a reference data set that >> could provide additional information on external IP addresses in netflow >> data, but I don't know that the ODM tables would need to be enriched. >> Couldn't we just do a JOIN when we query to display in the UI? >> >> I like the idea of Spot having a set of ODM schemas and a set of supported >> reference data schemas, of which perhaps this could be the first. >> >> Michael >> >> On Tue, Jun 27, 2017 at 9:31 PM, [email protected] < >> [email protected]> >> wrote: >> >>> If we query every time that we receive data we will kill the API, however >>> if we do it after the fact that spot have results we are adding context >> to >>> the suspicious results, can we explore what happen if we store the >> "common" >>> results and we just query things that are out of the range? How much >>> information we need to store is the other side of the question. >>> Agree with Vertika, if we can enrich the data down stream we will add >> value >>> to the solution. >>> Regards >>> >>> 2017-06-27 19:51 GMT-05:00 Vartika Singh <[email protected]>: >>> >>>> This looks interesting. I understand we can either directly query the >>>> database, or download point in time snapshots in specified frequent >>>> interval. >>>> >>>> Ideally the enrichment should be done in a Streaming job based on the >>>> snapshot downloaded. >>>> >>>> (Not sure if from within a Spot flow we would want to query the REST >> API >>>> available on the public internet. Or we can query the downloaded >> snapshot >>>> using REST API from the executors but then it may require some >> additional >>>> tuning. That's theoretical at this point.) >>>> >>>> Complexity would be defined by the size of the data snapshot data >>>> downloaded as well as the external IP Addresses flowing in the >>> micro-batch. >>>> >>>> I have seen such enrichment successfully in the past on large scale >>>> enrichment as well as IP addresses for a 50 node cluster with about 4 >>>> seconds of batch interval. The Ipaddresses were of the order of 400K >> and >>>> the enrichment data was of the order of 400K. It involved using a Map >>> side >>>> loop up and join and then sending the enriched data further down stream >>> to >>>> a Kafka topic. >>>> >>>> Thoughts? >>>> >>>> On Tue, Jun 27, 2017 at 7:26 PM, Cesar Berho <[email protected]> wrote: >>>> >>>>> Folks, >>>>> >>>>> I'd like to discuss the possibility of incorporating Censys for >>> profiling >>>>> and context enrichment of external IPv4 addresses on Spot. This >>> community >>>>> approach which leveraged ZMAP and ZGRAB to scans the Internet on a >>>>> recurrent basis and is building a complex map of services running, >> ASN >>>>> state, and SSL changes. >>>>> >>>>> High level description of the project below: >>>>> >>>>> "Censys is a search engine that allows computer scientists to ask >>>> questions >>>>> about the devices and networks that compose the Internet. Driven by >>>>> Internet-wide scanning, Censys lets researchers find specific hosts >> and >>>>> create aggregate reports on how devices, websites, and certificates >> are >>>>> configured and deployed." >>>>> >>>>> They have a REST API to do queries at volume, so then it can be >>>>> incorporated through some type of extension/plugin manager that can >>>>> enable/disable it on demand. >>>>> >>>>> A deep dive on the research done can be found over here: >>>>> >>>>> https://www.censys.io/static/censys.pdf >>>>> >>>>> Let get the discussion open an determine next steps from here. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Cesar >>>>> >>>> >>>> >>>> >>>> -- >>>> Vartika Singh >>>> Senior Solutions Architect >>>> Cloudera >>>> >>> >> >> >> >> -- >> Michael Ridley <[email protected]> >> office: (650) 352-1337 >> mobile: (571) 438-2420 >> Senior Solutions Architect >> Cloudera, Inc. >>
