Mathijs Homminga wrote: > Hi everyone, > > Our crawler generates and fetches segments continuously. We'd like to > index and merge each new segment immediately (or with a small delay) > such that our index grows incrementally. This is unlike the normal > situation where one would create a linkdb and an index of all segments > at once, after the crawl has finished. > > The problem we have is that Nutch currently needs the complete linkdb > and crawldb each time we want to index a single segment.
The reason for wanting the linkdb is the anchor information. If you don't need any anchor information, you can provide an empty linkdb. The reason why crawldb is needed is to get the current page status information (which may have changed in the meantime due to subsequent crawldb updates from newer segments). If you don't need this information, you can modify Indexer.reduce() (~line 212) method to allow for this, and then remove the line in Indexer.index() that adds crawldb to the list of input paths. > > The Indexer map task processes all keys (urls) from the input files > (linkdb, crawldb and segment). This includes all data from the linkdb > and crawldb that we actually don't need since we are only interested in > the data that corresponds to the keys (urls) in our segment (this is > filtered out in the Indexer reduce task). > Obviously, as the linkdb and crawldb grow, this becomes more and more of > a problem. Is this really a problem for you now? Unless your segments are tiny, the indexing process will be dominated by I/O from the processing of parseText / parseData and Lucene operations. > > Any ideas on how to tackle this issue? > Is it feasible to lookup the corresponding linkdb and crawldb data for > each key (url) in the segment before or during indexing? It would be probably too slow, unless you made a copy of linkdb/crawldb on the local FS-es of each node. But at this point the benefit of this change would be doubtful, because of all the I/O you would need to do to prepare each task's environment ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
