> > > On Thursday 14 July 2011 15:03:34 Julien Nioche wrote: > > Have been thinking about this again. We could make so that the indexer > does > > not necessarily require a linkDB : some people are not particularly > > interested in getting the anchors. At the moment you have to have a > linkDB. > > > > This would make it a bit simpler (and quicker) to index within a crawl > > iteration. Any thoughts on this? > > It still requires the CrawlDB right? Or are you suggesting we index without > mapping through the CrawlDB? >
My suggestion was only to make the linkDB optional to start with and would still require the crawldb. > And at which point during the crawl cycle? The fetcher with parsing > enabled? > It would still require the parsing to have been done (as part of fetching or separately - it does not matter) and an updated crawldb so after the update. Think about it in the following way : the indexer remains as it is but people will be able to do without the linkdb if they don't need the anchors. That's all > In that case, do we need url filtering and normalizing in the parse job? > See other thread - we already have that > > Anyway, take care of memory. The indexer can fill up your heap real quick, > even with smaller add buffers. And of course handing indexing failure. It's > not uncommon for requests to time out. And there is also the problem of an > unhappily timed commit, which currently stops all indexing to Solr (there's > an > issue for this i remember). > This are general issues with the indexing regardless of where it is called (end of crawl vs end of loop) Julien > > > > > On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io> wrote: > > > > Thanks for the responses :) > > > > > > > > So the size of the segments then i guess would determine the latency > > > > between crawling and indexing. > > > > > > The size of your crawldb may matter even more in some cases. If you > > > segment has just on file and your crawldb many millions, the indexing > > > takes forever. > > > > > > > I and my colleague will look more into the scripts to see how the > diffs > > > > > > get > > > > > > > pushed to Solr. > > > > > > > > Thanks again > > > > > > > > M > > > > > > > > > > > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney < > > > > > > > > lewis.mcgibb...@gmail.com> wrote: > > > > > To add to Julien's comments there was a contribution made by > Gabriele > > > > > a while ago which addressed this issue (however I have not used his > > > > > > scripts > > > > > > > > extensively). They might be of interest for a look. Try the link > > > > > below > > > > > > > http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script > > > > > > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche < > > > > > > > > > > lists.digitalpeb...@gmail.com> wrote: > > > > >> Hi Matthew, > > > > >> > > > > >> This is usually achieved by writing a script containing the > > > > >> individual Nutch commands (as opposed to calling 'nutch crawl') > and > > > > >> index at the end of a generate-fetch-parse-update-linkdb sequence. > > > > >> You don't need any plugins for that > > > > >> > > > > >> HTH > > > > >> > > > > >> Julien > > > > >> > > > > >> On 12 July 2011 13:35, Matthew Painter < > matthew.pain...@kusiri.com > > > > > > > >wrote: > > > > >>> Hi all, > > > > >>> > > > > >>> I was wondering about the feasibility of creating a plugin for > > > > >>> nutch that create a solr update command, and added it to a queue > > > > >>> for indexing after it first parses the page, rather than when > > > > >>> crawling > > > > > > has > > > > > > > >>> finished. > > > > >>> > > > > >>> This would allow you to do "real-time" indexing when crawling. > > > > >>> > > > > >>> Drawbacks: Not able to use the graph to give relevancy > information. > > > > >>> > > > > >>> Wondering what initial thoughts are about this? > > > > >>> > > > > >>> Thanks :) > > > > >> > > > > >> -- > > > > >> * > > > > >> *Open Source Solutions for Text Engineering > > > > >> > > > > >> http://digitalpebble.blogspot.com/ > > > > >> http://www.digitalpebble.com > > > > > > > > > > -- > > > > > *Lewis* > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com