On Friday 15 July 2011 11:07:36 Julien Nioche wrote: > > On Thursday 14 July 2011 15:03:34 Julien Nioche wrote: > > > Have been thinking about this again. We could make so that the indexer > > > > does > > > > > not necessarily require a linkDB : some people are not particularly > > > interested in getting the anchors. At the moment you have to have a > > > > linkDB. > > > > > This would make it a bit simpler (and quicker) to index within a crawl > > > iteration. Any thoughts on this? > > > > It still requires the CrawlDB right? Or are you suggesting we index > > without mapping through the CrawlDB? > > My suggestion was only to make the linkDB optional to start with and would > still require the crawldb.
Not having to use (and create) a link db would be great in many cases. Inverting links in every crawl cycle drops average throughput considerably. > > > And at which point during the crawl cycle? The fetcher with parsing > > enabled? > > It would still require the parsing to have been done (as part of fetching > or separately - it does not matter) and an updated crawldb so after the > update. Think about it in the following way : the indexer remains as it is > but people will be able to do without the linkdb if they don't need the > anchors. That's all > > > In that case, do we need url filtering and normalizing in the parse job? > > See other thread - we already have that > > > Anyway, take care of memory. The indexer can fill up your heap real > > quick, even with smaller add buffers. And of course handing indexing > > failure. It's not uncommon for requests to time out. And there is also > > the problem of an unhappily timed commit, which currently stops all > > indexing to Solr (there's an > > issue for this i remember). > > This are general issues with the indexing regardless of where it is called > (end of crawl vs end of loop) > > Julien > > > > On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io> wrote: > > > > > Thanks for the responses :) > > > > > > > > > > So the size of the segments then i guess would determine the > > > > > latency between crawling and indexing. > > > > > > > > The size of your crawldb may matter even more in some cases. If you > > > > segment has just on file and your crawldb many millions, the indexing > > > > takes forever. > > > > > > > > > I and my colleague will look more into the scripts to see how the > > > > diffs > > > > > > get > > > > > > > > > pushed to Solr. > > > > > > > > > > Thanks again > > > > > > > > > > M > > > > > > > > > > > > > > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney < > > > > > > > > > > lewis.mcgibb...@gmail.com> wrote: > > > > > > To add to Julien's comments there was a contribution made by > > > > Gabriele > > > > > > > > a while ago which addressed this issue (however I have not used > > > > > > his > > > > > > > > scripts > > > > > > > > > > extensively). They might be of interest for a look. Try the link > > > > > > below > > > > http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script > > > > > > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche < > > > > > > > > > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > >> Hi Matthew, > > > > > >> > > > > > >> This is usually achieved by writing a script containing the > > > > > >> individual Nutch commands (as opposed to calling 'nutch crawl') > > > > and > > > > > > > >> index at the end of a generate-fetch-parse-update-linkdb > > > > > >> sequence. You don't need any plugins for that > > > > > >> > > > > > >> HTH > > > > > >> > > > > > >> Julien > > > > > >> > > > > > >> On 12 July 2011 13:35, Matthew Painter < > > > > matthew.pain...@kusiri.com > > > > > > >wrote: > > > > > >>> Hi all, > > > > > >>> > > > > > >>> I was wondering about the feasibility of creating a plugin for > > > > > >>> nutch that create a solr update command, and added it to a > > > > > >>> queue for indexing after it first parses the page, rather than > > > > > >>> when crawling > > > > > > > > has > > > > > > > > > >>> finished. > > > > > >>> > > > > > >>> This would allow you to do "real-time" indexing when crawling. > > > > > >>> > > > > > >>> Drawbacks: Not able to use the graph to give relevancy > > > > information. > > > > > > > >>> Wondering what initial thoughts are about this? > > > > > >>> > > > > > >>> Thanks :) > > > > > >> > > > > > >> -- > > > > > >> * > > > > > >> *Open Source Solutions for Text Engineering > > > > > >> > > > > > >> http://digitalpebble.blogspot.com/ > > > > > >> http://www.digitalpebble.com > > > > > > > > > > > > -- > > > > > > *Lewis* > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350