If we query every time that we receive data we will kill the API, however
if we do it after the fact that spot have results we are adding context to
the suspicious results, can we explore what happen if we store the "common"
results and we just query things that are out of the range? How much
information we need to store is the other side of the question.
Agree with Vertika, if we can enrich the data down stream we will add value
to the solution.
Regards

2017-06-27 19:51 GMT-05:00 Vartika Singh <[email protected]>:

> This looks interesting. I understand we can either directly query the
> database, or download point in time snapshots in specified frequent
> interval.
>
> Ideally the enrichment should be done in a Streaming job based on the
> snapshot downloaded.
>
> (Not sure if from within a Spot flow we would want to query the REST API
> available on the public internet. Or we can query the downloaded snapshot
> using REST API from the executors but then it may require some additional
> tuning. That's theoretical at this point.)
>
> Complexity would be defined by the size of the data snapshot data
> downloaded as well as the external IP Addresses flowing in the micro-batch.
>
> I have seen such enrichment successfully in the past on large scale
> enrichment as well as IP addresses for a 50 node cluster with about 4
> seconds of batch interval. The Ipaddresses were of the order of 400K and
> the enrichment data was of the order of 400K. It involved using a Map side
> loop up and join and then sending the enriched data further down stream to
> a Kafka topic.
>
> Thoughts?
>
> On Tue, Jun 27, 2017 at 7:26 PM, Cesar Berho <[email protected]> wrote:
>
> > Folks,
> >
> > I'd like to discuss the possibility of incorporating Censys for profiling
> > and context enrichment of external IPv4 addresses on Spot. This community
> > approach which leveraged ZMAP and ZGRAB to scans the Internet on a
> > recurrent basis and is building a complex map of services running, ASN
> > state, and SSL changes.
> >
> > High level description of the project below:
> >
> > "Censys is a search engine that allows computer scientists to ask
> questions
> > about the devices and networks that compose the Internet. Driven by
> > Internet-wide scanning, Censys lets researchers find specific hosts and
> > create aggregate reports on how devices, websites, and certificates are
> > configured and deployed."
> >
> > They have a REST API to do queries at volume, so then it can be
> > incorporated through some type of extension/plugin manager that can
> > enable/disable it on demand.
> >
> > A deep dive on the research done can be found over here:
> >
> > https://www.censys.io/static/censys.pdf
> >
> > Let get the discussion open an determine next steps from here.
> >
> >
> > Thanks,
> >
> > Cesar
> >
>
>
>
> --
> Vartika Singh
> Senior Solutions Architect
> Cloudera
>

Reply via email to