Re: Nutch Extension for realtime processing

Julien Nioche Thu, 19 Jun 2014 01:40:24 -0700

Hi Jake,

Thanks for taking the time to explain what you want to do with the
Disseminator. It does make sense but to be valuable would have to be quite
generic so that people can have different uses of it.


For instance one thing I did for a customer was to hack the Fetcher so that
we send statistics to a monitoring tool e.g. statsd / Graphite / Librato.
This way I can see the evolution of metrics during the Fetching (active
threads / number of queues / number of URLS in queues/ URLs fetched, IO
etc... ). We have these figures on the task UI but it helps to see how they
evolve over time. The fetching step is the only place where it makes sense
to have this as the other steps are quite linear in the way they work.
Anyway, would that fit with the Disseminator? Not sure, we could have yet
another plugin to do that.

The logging use case with  Kibana is a good example : I did something
something similar with the StormCrawler (its logging mechanims being a lot
worse than what Hadoop gives us). A nice thing to have would be a slf4j
extension that sends the logs to ElasticSearch but this is a different
subject.

The question is : can we do all these things within a single plugin
extension?

Julien



On 18 June 2014 17:18, Jake Dodd <j...@ontopic.io> wrote:

> Hi Julien,
>
> Yep, you’re correct about the generation step being a limiting factor in
> getting new content in realtime—i.e. nearly as soon as it appears on the
> web. But that isn’t *quite* what I meant, so I’ll clarify what I mean by
> “realtime”, in the context of Nutch as it exists today: getting access to
> data as soon as it’s fetched, rather than waiting for the fetch job (and
> any subsequent jobs) to finish.
>
> The update and index steps would be called as usual, with no modifications
> to any existing Nutch workflows. In the prototype that I banged out
> yesterday, the dissemination occurs in the
> org.apache.nutch.fetcher.Fetcher.FetcherThread.output() method, after the
> output has been collected. More specifically, a Disseminator isn’t a
> substitute for an indexer. It simply makes online data about the fetch
> available to outside sources and services. The Disseminator would be
> invisible to people who choose not to use dissemination plugins.
>
> I can think of a couple of example use cases to illustrate. One use would
> be to create a Disseminator plugin that would collect fetch metadata for
> each URL (response code, content type, number of outlines, host, domain,
> etc), format it as a Logstash event, and send it to an Elasticsearch
> cluster. A user could then use the Logstash/Kibana/Elasticsearch stack for
> detailed and highly visual monitoring of a Nutch crawl, with very little
> engineering involved, and no modification of the Nutch source—only
> development of a plugin. For smaller fetches, as Markus suggested, a
> Logstash “Indexer” could work for this; but for a longer fetch, it would be
> cool to monitor the crawl (not just the process, but the fetch data itself)
> in realtime without digging through Hadoop logs.
>
> A more advanced use case would be to disseminate the actual page content
> (either raw, or after parsed if parsing during fetch is enabled) to Apache
> Spark. From there, pages could be classified using SVM or a Bayes
> classifier in Spark’s MLLib. Once the fetch is finished, during indexing, a
> custom IndexingFilter could read the classifications—already generated—and
> filter indexing according to the classifications. If anybody has a
> classification-based IndexingFilter, this could greatly speed up their
> workflow.
>
> These are just off the top of my head—people are creative, I’m sure there
> are even cooler things someone could think up!
>
>  As you mentioned in the talk that you shared, doing everything
> continuously would be an enormous undertaking, requiring a major overhaul
> of Nutch and a migration from MR. But creating a plugin-based hook to the
> Fetcher seems to be relatively trivial.
>
> The storm-crawler project looks neat! We’ve contemplated building
> something similar that would reuse elements from Nutch where possible.
>
> Cheers
>
> Jake
>
> On Jun 18, 2014, at 1:34 AM, Julien Nioche <lists.digitalpeb...@gmail.com>
> wrote:
>
> Hi Jake
>
> Great to hear about your ideas. Sounds like what you are proposing would
> be only "near" realtime as much would depend on the generation which is a
> batch step. How / when would the update step be called? Would this be a
> fetcher only i.e. does not recursively discover links. If so why not going
> 100% real time as done with something like
> https://github.com/DigitalPebble/storm-crawler?
>
> I hinted at similar things in my recent talks : see for instance
> http://www.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013 slide
> #40 (video on  *http://youtu.be/KyHPBtRlo80?t=42m
> <http://youtu.be/KyHPBtRlo80?t=42m>*. As mentioned in the talk above, I
> can see a hybrid model mixing batch and real time processes as a good
> solution. I recently looked at Apache Spark as it allows to mix batch,
> micro-batch and graph computation as it sounded like a good framework for
> doing this but haven't had the time to go very far.  In an ideal world,
> we'd be continuously fetching, parsing and updating, wouldn't we?
>
> Did I get your suggestion right?
>
> Julien
>
>
>
> On 17 June 2014 23:54, Jake Dodd <j...@ontopic.io> wrote:
>
>> Markus: The indexer plugin idea definitely works if the goal is only to
>> pass Nutch-collected data to realtime frameworks. However, there are some
>> cool things that you can do in “real" realtime (heh), as opposed to the
>> batch nature of Nutch’s indexing plugins and the FetcherOutputFormat.
>> Moreover, it would be cool to have Nutch working as designed (with
>> fetching, parsing, indexing and all) while basically gaining the realtime
>> capabilities for free.
>>
>> Chris: Glad to hear you’re interested, and thanks for the link! Today I
>> was actually able to finish a prototype version of this, along with two
>> example Disseminator plugins (one to stdout, the other to a Kafka
>> topic—both working beautifully). I’d be happy to create a New Feature JIRA
>> and start working on this.
>>
>> Cheers
>>
>> Jake
>>
>>
>>
>> On Jun 17, 2014, at 11:02 AM, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>
>> Jake I am totally interested in this. Contributing to Nutch (and more
>> generally to Apache projects) is described really well (by Dennis Kubes)
>> here:
>>
>> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>>
>>
>> Looking forward to seeing your contributions!
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Markus Jelsma <markus.jel...@openindex.io>
>> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>> Date: Tuesday, June 17, 2014 10:55 AM
>> To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>> Subject: RE: Nutch Extension for realtime processing
>>
>> Hi Jake,
>>
>> It would be more pluggable if you just implement an indexer backend
>> plugin for your target (storm, spark) so you can use the existing
>> indexing filtering framework and plugins to enrich the data. If you then
>> couple the indexing logic to FetcherOutputFormat, you can skip the parse
>> (because this requires a parsing fetcher) and updatedb jobs, as well as
>> the separate indexing job. This is certainly not real time but the delay
>> is much smaller, especially if you keep to (many) small fetch jobs. In
>> our environment we can guarantee a fetched document is always indexed
>> within 15 minutes.
>>
>> Markus
>>
>> -----Original message-----
>>
>> From:Jake Dodd <j...@ontopic.io>
>> Sent: Tuesday 17th June 2014 19:30
>> To: dev@nutch.apache.org
>> Subject: Nutch Extension for realtime processing
>>
>> Hi all,
>>
>> My organization is mulling the creation of a Nutch Extension Point that
>> would enable realtime processing of Nutch documents as they¹re fetched.
>> We have the desire to pass Nutch-fetched documents to a realtime
>> framework such as Storm or Spark. Currently, it¹s trivial to implement a
>> custom Indexer plugin that sort of gets the job done. However, this
>> doesn¹t really meet the realtime requirement‹you must wait for the
>> fetch, parse, updateddb, index cycle to complete.
>>
>> Our idea is to create a FetcherDisseminator extension point. A
>> FetcherDisseminator would implement a disseminate() method that would
>> take care of serialization (JSON, Avro, etc) and disseminating the data
>> to an external entity (for example a REST interface, or a Kafka broker).
>>
>> The FetcherDisseminators would be called from within the
>> org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation
>> would be such that the normal fetch-parse-update-index cycle would be
>> unaffected, even in the case of disseminator failure.
>>
>> My first question is whether something like this has been discussed
>> before by the Nutch developers, and if so, if there is any current work
>> on the project.
>>
>> My second question is whether there is any interest from the community
>> in such a feature. If so, we¹d love your input on how to go about
>> contributing to the Nutch project.
>>
>> Cheers
>>
>> Jake
>>
>>
>>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch Extension for realtime processing

Reply via email to