Re: Nutch Extension for realtime processing

Julien Nioche Wed, 18 Jun 2014 01:36:08 -0700

Hi Jake

Great to hear about your ideas. Sounds like what you are proposing would be
only "near" realtime as much would depend on the generation which is a
batch step. How / when would the update step be called? Would this be a
fetcher only i.e. does not recursively discover links. If so why not going
100% real time as done with something like
https://github.com/DigitalPebble/storm-crawler?


I hinted at similar things in my recent talks : see for instance
http://www.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013 slide #40
(video on  *http://youtu.be/KyHPBtRlo80?t=42m
<http://youtu.be/KyHPBtRlo80?t=42m>*. As mentioned in the talk above, I can
see a hybrid model mixing batch and real time processes as a good solution.
I recently looked at Apache Spark as it allows to mix batch, micro-batch
and graph computation as it sounded like a good framework for doing this
but haven't had the time to go very far.  In an ideal world, we'd be
continuously fetching, parsing and updating, wouldn't we?

Did I get your suggestion right?

Julien



On 17 June 2014 23:54, Jake Dodd <j...@ontopic.io> wrote:

> Markus: The indexer plugin idea definitely works if the goal is only to
> pass Nutch-collected data to realtime frameworks. However, there are some
> cool things that you can do in “real" realtime (heh), as opposed to the
> batch nature of Nutch’s indexing plugins and the FetcherOutputFormat.
> Moreover, it would be cool to have Nutch working as designed (with
> fetching, parsing, indexing and all) while basically gaining the realtime
> capabilities for free.
>
> Chris: Glad to hear you’re interested, and thanks for the link! Today I
> was actually able to finish a prototype version of this, along with two
> example Disseminator plugins (one to stdout, the other to a Kafka
> topic—both working beautifully). I’d be happy to create a New Feature JIRA
> and start working on this.
>
> Cheers
>
> Jake
>
>
>
> On Jun 17, 2014, at 11:02 AM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> Jake I am totally interested in this. Contributing to Nutch (and more
> generally to Apache projects) is described really well (by Dennis Kubes)
> here:
>
> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>
>
> Looking forward to seeing your contributions!
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Markus Jelsma <markus.jel...@openindex.io>
> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
> Date: Tuesday, June 17, 2014 10:55 AM
> To: "dev@nutch.apache.org" <dev@nutch.apache.org>
> Subject: RE: Nutch Extension for realtime processing
>
> Hi Jake,
>
> It would be more pluggable if you just implement an indexer backend
> plugin for your target (storm, spark) so you can use the existing
> indexing filtering framework and plugins to enrich the data. If you then
> couple the indexing logic to FetcherOutputFormat, you can skip the parse
> (because this requires a parsing fetcher) and updatedb jobs, as well as
> the separate indexing job. This is certainly not real time but the delay
> is much smaller, especially if you keep to (many) small fetch jobs. In
> our environment we can guarantee a fetched document is always indexed
> within 15 minutes.
>
> Markus
>
> -----Original message-----
>
> From:Jake Dodd <j...@ontopic.io>
> Sent: Tuesday 17th June 2014 19:30
> To: dev@nutch.apache.org
> Subject: Nutch Extension for realtime processing
>
> Hi all,
>
> My organization is mulling the creation of a Nutch Extension Point that
> would enable realtime processing of Nutch documents as they¹re fetched.
> We have the desire to pass Nutch-fetched documents to a realtime
> framework such as Storm or Spark. Currently, it¹s trivial to implement a
> custom Indexer plugin that sort of gets the job done. However, this
> doesn¹t really meet the realtime requirement‹you must wait for the
> fetch, parse, updateddb, index cycle to complete.
>
> Our idea is to create a FetcherDisseminator extension point. A
> FetcherDisseminator would implement a disseminate() method that would
> take care of serialization (JSON, Avro, etc) and disseminating the data
> to an external entity (for example a REST interface, or a Kafka broker).
>
> The FetcherDisseminators would be called from within the
> org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation
> would be such that the normal fetch-parse-update-index cycle would be
> unaffected, even in the case of disseminator failure.
>
> My first question is whether something like this has been discussed
> before by the Nutch developers, and if so, if there is any current work
> on the project.
>
> My second question is whether there is any interest from the community
> in such a feature. If so, we¹d love your input on how to go about
> contributing to the Nutch project.
>
> Cheers
>
> Jake
>
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch Extension for realtime processing

Reply via email to