Re: Nutch Extension for realtime processing

Jake Dodd Thu, 19 Jun 2014 10:47:05 -0700

Hi Julien,

Yep, the statsd / Graphite / Librato example falls right in line with that I’m 
thinking!


I have some thoughts on how to generalize the FetcherDisseminator extension. In 
my prototype, I have DisseminatorDocument and DisseminatorField classes that, 
functionally, are similar to NutchDocument and NutchField. The primary 
differences are that there are no weights in the DisseminatorDocument, and it’s 
not a Writeable.

A DisseminatorDocument could have a type that would be ‘meta’ or  ‘item.’ Meta 
DisseminatorDocuments would contain fetcher meta/status information (number of 
queues, threads, # items to be fetched, remaining, succeeded, failed, etc). 
Item DisseminatorDocuments would contain data about individual items fetched 
(URL, content type, and the like).

Each FetcherDisseminator plugin would be required to identify as an item 
handler, meta handler, or both. A FetcherDisseminators class (nearly identical 
to the IndexingFilters class), within its disseminate(DisseminatorDocument doc) 
method, would then be responsible for passing the doc to each registered 
FetcherDisseminator plugin that has declared a matching handler type.

This way, a user wanting to create a FetcherDisseminator plugin would only have 
to specify the handler type and implement a disseminate() method, and not worry 
about customized handling for item data vs. fetcher status data. 

 Cheers

Jake

On Jun 19, 2014, at 1:39 AM, Julien Nioche <lists.digitalpeb...@gmail.com> 
wrote:

> Hi Jake, 
> 
> Thanks for taking the time to explain what you want to do with the 
> Disseminator. It does make sense but to be valuable would have to be quite 
> generic so that people can have different uses of it. 
> 
> For instance one thing I did for a customer was to hack the Fetcher so that 
> we send statistics to a monitoring tool e.g. statsd / Graphite / Librato. 
> This way I can see the evolution of metrics during the Fetching (active 
> threads / number of queues / number of URLS in queues/ URLs fetched, IO 
> etc... ). We have these figures on the task UI but it helps to see how they 
> evolve over time. The fetching step is the only place where it makes sense to 
> have this as the other steps are quite linear in the way they work. Anyway, 
> would that fit with the Disseminator? Not sure, we could have yet another 
> plugin to do that.
> 
> The logging use case with  Kibana is a good example : I did something 
> something similar with the StormCrawler (its logging mechanims being a lot 
> worse than what Hadoop gives us). A nice thing to have would be a slf4j 
> extension that sends the logs to ElasticSearch but this is a different 
> subject.
> 
> The question is : can we do all these things within a single plugin 
> extension? 
> 
> Julien
> 
> 
> 
> On 18 June 2014 17:18, Jake Dodd <j...@ontopic.io> wrote:
> Hi Julien,
> 
> Yep, you’re correct about the generation step being a limiting factor in 
> getting new content in realtime—i.e. nearly as soon as it appears on the web. 
> But that isn’t quite what I meant, so I’ll clarify what I mean by “realtime”, 
> in the context of Nutch as it exists today: getting access to data as soon as 
> it’s fetched, rather than waiting for the fetch job (and any subsequent jobs) 
> to finish.
> 
> The update and index steps would be called as usual, with no modifications to 
> any existing Nutch workflows. In the prototype that I banged out yesterday, 
> the dissemination occurs in the 
> org.apache.nutch.fetcher.Fetcher.FetcherThread.output() method, after the 
> output has been collected. More specifically, a Disseminator isn’t a 
> substitute for an indexer. It simply makes online data about the fetch 
> available to outside sources and services. The Disseminator would be 
> invisible to people who choose not to use dissemination plugins.
> 
> I can think of a couple of example use cases to illustrate. One use would be 
> to create a Disseminator plugin that would collect fetch metadata for each 
> URL (response code, content type, number of outlines, host, domain, etc), 
> format it as a Logstash event, and send it to an Elasticsearch cluster. A 
> user could then use the Logstash/Kibana/Elasticsearch stack for detailed and 
> highly visual monitoring of a Nutch crawl, with very little engineering 
> involved, and no modification of the Nutch source—only development of a 
> plugin. For smaller fetches, as Markus suggested, a Logstash “Indexer” could 
> work for this; but for a longer fetch, it would be cool to monitor the crawl 
> (not just the process, but the fetch data itself) in realtime without digging 
> through Hadoop logs.
> 
> A more advanced use case would be to disseminate the actual page content 
> (either raw, or after parsed if parsing during fetch is enabled) to Apache 
> Spark. From there, pages could be classified using SVM or a Bayes classifier 
> in Spark’s MLLib. Once the fetch is finished, during indexing, a custom 
> IndexingFilter could read the classifications—already generated—and filter 
> indexing according to the classifications. If anybody has a 
> classification-based IndexingFilter, this could greatly speed up their 
> workflow.
> 
> These are just off the top of my head—people are creative, I’m sure there are 
> even cooler things someone could think up!  
> 
>  As you mentioned in the talk that you shared, doing everything continuously 
> would be an enormous undertaking, requiring a major overhaul of Nutch and a 
> migration from MR. But creating a plugin-based hook to the Fetcher seems to 
> be relatively trivial.
> 
> The storm-crawler project looks neat! We’ve contemplated building something 
> similar that would reuse elements from Nutch where possible.
> 
> Cheers
> 
> Jake
> 
> On Jun 18, 2014, at 1:34 AM, Julien Nioche <lists.digitalpeb...@gmail.com> 
> wrote:
> 
>> Hi Jake
>> 
>> Great to hear about your ideas. Sounds like what you are proposing would be 
>> only "near" realtime as much would depend on the generation which is a batch 
>> step. How / when would the update step be called? Would this be a fetcher 
>> only i.e. does not recursively discover links. If so why not going 100% real 
>> time as done with something like 
>> https://github.com/DigitalPebble/storm-crawler?
>> 
>> I hinted at similar things in my recent talks : see for instance 
>> http://www.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013 slide #40 
>> (video on  http://youtu.be/KyHPBtRlo80?t=42m. As mentioned in the talk 
>> above, I can see a hybrid model mixing batch and real time processes as a 
>> good solution. I recently looked at Apache Spark as it allows to mix batch, 
>> micro-batch and graph computation as it sounded like a good framework for 
>> doing this but haven't had the time to go very far.  In an ideal world, we'd 
>> be continuously fetching, parsing and updating, wouldn't we?
>> 
>> Did I get your suggestion right?
>> 
>> Julien
>> 
>> 
>> 
>> On 17 June 2014 23:54, Jake Dodd <j...@ontopic.io> wrote:
>> Markus: The indexer plugin idea definitely works if the goal is only to pass 
>> Nutch-collected data to realtime frameworks. However, there are some cool 
>> things that you can do in “real" realtime (heh), as opposed to the batch 
>> nature of Nutch’s indexing plugins and the FetcherOutputFormat. Moreover, it 
>> would be cool to have Nutch working as designed (with fetching, parsing, 
>> indexing and all) while basically gaining the realtime capabilities for free.
>> 
>> Chris: Glad to hear you’re interested, and thanks for the link! Today I was 
>> actually able to finish a prototype version of this, along with two example 
>> Disseminator plugins (one to stdout, the other to a Kafka topic—both working 
>> beautifully). I’d be happy to create a New Feature JIRA and start working on 
>> this.
>> 
>> Cheers
>> 
>> Jake
>> 
>> 
>> 
>> On Jun 17, 2014, at 11:02 AM, Mattmann, Chris A (3980) 
>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>> 
>>> Jake I am totally interested in this. Contributing to Nutch (and more
>>> generally to Apache projects) is described really well (by Dennis Kubes)
>>> here:
>>> 
>>> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>>> 
>>> 
>>> Looking forward to seeing your contributions!
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>>> Date: Tuesday, June 17, 2014 10:55 AM
>>> To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>>> Subject: RE: Nutch Extension for realtime processing
>>> 
>>>> Hi Jake,
>>>> 
>>>> It would be more pluggable if you just implement an indexer backend
>>>> plugin for your target (storm, spark) so you can use the existing
>>>> indexing filtering framework and plugins to enrich the data. If you then
>>>> couple the indexing logic to FetcherOutputFormat, you can skip the parse
>>>> (because this requires a parsing fetcher) and updatedb jobs, as well as
>>>> the separate indexing job. This is certainly not real time but the delay
>>>> is much smaller, especially if you keep to (many) small fetch jobs. In
>>>> our environment we can guarantee a fetched document is always indexed
>>>> within 15 minutes.
>>>> 
>>>> Markus 
>>>> 
>>>> -----Original message-----
>>>>> From:Jake Dodd <j...@ontopic.io>
>>>>> Sent: Tuesday 17th June 2014 19:30
>>>>> To: dev@nutch.apache.org
>>>>> Subject: Nutch Extension for realtime processing
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> My organization is mulling the creation of a Nutch Extension Point that
>>>>> would enable realtime processing of Nutch documents as they¹re fetched.
>>>>> We have the desire to pass Nutch-fetched documents to a realtime
>>>>> framework such as Storm or Spark. Currently, it¹s trivial to implement a
>>>>> custom Indexer plugin that sort of gets the job done. However, this
>>>>> doesn¹t really meet the realtime requirement‹you must wait for the
>>>>> fetch, parse, updateddb, index cycle to complete.
>>>>> 
>>>>> Our idea is to create a FetcherDisseminator extension point. A
>>>>> FetcherDisseminator would implement a disseminate() method that would
>>>>> take care of serialization (JSON, Avro, etc) and disseminating the data
>>>>> to an external entity (for example a REST interface, or a Kafka broker).
>>>>> 
>>>>> The FetcherDisseminators would be called from within the
>>>>> org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation
>>>>> would be such that the normal fetch-parse-update-index cycle would be
>>>>> unaffected, even in the case of disseminator failure.
>>>>> 
>>>>> My first question is whether something like this has been discussed
>>>>> before by the Nutch developers, and if so, if there is any current work
>>>>> on the project.
>>>>> 
>>>>> My second question is whether there is any interest from the community
>>>>> in such a feature. If so, we¹d love your input on how to go about
>>>>> contributing to the Nutch project.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Jake
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
> 
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Nutch Extension for realtime processing

Reply via email to