Hi Jake Great to hear about your ideas. Sounds like what you are proposing would be only "near" realtime as much would depend on the generation which is a batch step. How / when would the update step be called? Would this be a fetcher only i.e. does not recursively discover links. If so why not going 100% real time as done with something like https://github.com/DigitalPebble/storm-crawler?
I hinted at similar things in my recent talks : see for instance http://www.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013 slide #40 (video on *http://youtu.be/KyHPBtRlo80?t=42m <http://youtu.be/KyHPBtRlo80?t=42m>*. As mentioned in the talk above, I can see a hybrid model mixing batch and real time processes as a good solution. I recently looked at Apache Spark as it allows to mix batch, micro-batch and graph computation as it sounded like a good framework for doing this but haven't had the time to go very far. In an ideal world, we'd be continuously fetching, parsing and updating, wouldn't we? Did I get your suggestion right? Julien On 17 June 2014 23:54, Jake Dodd <j...@ontopic.io> wrote: > Markus: The indexer plugin idea definitely works if the goal is only to > pass Nutch-collected data to realtime frameworks. However, there are some > cool things that you can do in “real" realtime (heh), as opposed to the > batch nature of Nutch’s indexing plugins and the FetcherOutputFormat. > Moreover, it would be cool to have Nutch working as designed (with > fetching, parsing, indexing and all) while basically gaining the realtime > capabilities for free. > > Chris: Glad to hear you’re interested, and thanks for the link! Today I > was actually able to finish a prototype version of this, along with two > example Disseminator plugins (one to stdout, the other to a Kafka > topic—both working beautifully). I’d be happy to create a New Feature JIRA > and start working on this. > > Cheers > > Jake > > > > On Jun 17, 2014, at 11:02 AM, Mattmann, Chris A (3980) < > chris.a.mattm...@jpl.nasa.gov> wrote: > > Jake I am totally interested in this. Contributing to Nutch (and more > generally to Apache projects) is described really well (by Dennis Kubes) > here: > > http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer > > > Looking forward to seeing your contributions! > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Markus Jelsma <markus.jel...@openindex.io> > Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org> > Date: Tuesday, June 17, 2014 10:55 AM > To: "dev@nutch.apache.org" <dev@nutch.apache.org> > Subject: RE: Nutch Extension for realtime processing > > Hi Jake, > > It would be more pluggable if you just implement an indexer backend > plugin for your target (storm, spark) so you can use the existing > indexing filtering framework and plugins to enrich the data. If you then > couple the indexing logic to FetcherOutputFormat, you can skip the parse > (because this requires a parsing fetcher) and updatedb jobs, as well as > the separate indexing job. This is certainly not real time but the delay > is much smaller, especially if you keep to (many) small fetch jobs. In > our environment we can guarantee a fetched document is always indexed > within 15 minutes. > > Markus > > -----Original message----- > > From:Jake Dodd <j...@ontopic.io> > Sent: Tuesday 17th June 2014 19:30 > To: dev@nutch.apache.org > Subject: Nutch Extension for realtime processing > > Hi all, > > My organization is mulling the creation of a Nutch Extension Point that > would enable realtime processing of Nutch documents as they¹re fetched. > We have the desire to pass Nutch-fetched documents to a realtime > framework such as Storm or Spark. Currently, it¹s trivial to implement a > custom Indexer plugin that sort of gets the job done. However, this > doesn¹t really meet the realtime requirement‹you must wait for the > fetch, parse, updateddb, index cycle to complete. > > Our idea is to create a FetcherDisseminator extension point. A > FetcherDisseminator would implement a disseminate() method that would > take care of serialization (JSON, Avro, etc) and disseminating the data > to an external entity (for example a REST interface, or a Kafka broker). > > The FetcherDisseminators would be called from within the > org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation > would be such that the normal fetch-parse-update-index cycle would be > unaffected, even in the case of disseminator failure. > > My first question is whether something like this has been discussed > before by the Nutch developers, and if so, if there is any current work > on the project. > > My second question is whether there is any interest from the community > in such a feature. If so, we¹d love your input on how to go about > contributing to the Nutch project. > > Cheers > > Jake > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble