Have been thinking about this again. We could make so that the indexer does not necessarily require a linkDB : some people are not particularly interested in getting the anchors. At the moment you have to have a linkDB.
This would make it a bit simpler (and quicker) to index within a crawl iteration. Any thoughts on this? On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io> wrote: > > > Thanks for the responses :) > > > > So the size of the segments then i guess would determine the latency > > between crawling and indexing. > > The size of your crawldb may matter even more in some cases. If you segment > has just on file and your crawldb many millions, the indexing takes > forever. > > > > > I and my colleague will look more into the scripts to see how the diffs > get > > pushed to Solr. > > > > Thanks again > > > > M > > > > > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney < > > > > lewis.mcgibb...@gmail.com> wrote: > > > To add to Julien's comments there was a contribution made by Gabriele a > > > while ago which addressed this issue (however I have not used his > scripts > > > extensively). They might be of interest for a look. Try the link below > > > > > > > http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script > > > > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche < > > > > > > lists.digitalpeb...@gmail.com> wrote: > > >> Hi Matthew, > > >> > > >> This is usually achieved by writing a script containing the individual > > >> Nutch commands (as opposed to calling 'nutch crawl') and index at the > > >> end of a generate-fetch-parse-update-linkdb sequence. You don't need > > >> any plugins for that > > >> > > >> HTH > > >> > > >> Julien > > >> > > >> On 12 July 2011 13:35, Matthew Painter <matthew.pain...@kusiri.com > >wrote: > > >>> Hi all, > > >>> > > >>> I was wondering about the feasibility of creating a plugin for nutch > > >>> that create a solr update command, and added it to a queue for > > >>> indexing after it first parses the page, rather than when crawling > has > > >>> finished. > > >>> > > >>> This would allow you to do "real-time" indexing when crawling. > > >>> > > >>> Drawbacks: Not able to use the graph to give relevancy information. > > >>> > > >>> Wondering what initial thoughts are about this? > > >>> > > >>> Thanks :) > > >> > > >> -- > > >> * > > >> *Open Source Solutions for Text Engineering > > >> > > >> http://digitalpebble.blogspot.com/ > > >> http://www.digitalpebble.com > > > > > > -- > > > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com