Hi everyone,

Our crawler generates and fetches segments continuously. We'd like to 
index and merge each new segment immediately (or with a small delay) 
such that our index grows incrementally. This is unlike the normal 
situation where one would create a linkdb and an index of all segments 
at once, after the crawl has finished.

The problem we have is that Nutch currently needs the complete linkdb 
and crawldb each time we want to index a single segment.

The Indexer map task processes all keys (urls) from the input files 
(linkdb, crawldb and segment). This includes all data from the linkdb 
and crawldb that we actually don't need since we are only interested in 
the data that corresponds to the keys (urls) in our segment (this is 
filtered out in the Indexer reduce task).
Obviously, as the linkdb and crawldb grow, this becomes more and more of 
a problem.

Any ideas on how to tackle this issue?
Is it feasible to lookup the corresponding linkdb and crawldb data for 
each key (url) in the segment before or during indexing?

Thanks!
Mathijs Homminga

-- 
Knowlogy
Helperpark 290 C
9723 ZA Groningen

[EMAIL PROTECTED]
+31 (0)6 15312977
http://www.knowlogy.nl



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to