Kai_testing Middleton wrote:
> I've been reviewing the four different merge commands (as of nutch v0.9):
> 
> $ nutch | grep merg
>   mergedb           merge crawldb-s, with optional filtering
>   mergesegs         merge several segments, with optional filtering and 
> slicing
>   mergelinkdb       merge linkdb-s, with optional filtering
>   merge             merge several segment indexes
> 
> Here are the javadocs:
> mergedb -- 
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
> mergesegs -- 
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
> mergelinkdb -- 
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
> merge -- 
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html
> 
> Naively: why are there four merge commands? Are some subsets of the others?  
> Are they used in conjunction? What are the usage scenarios of each?
> 
> I notice that Andrzej wrote the first three, and they have wiki entries 
> (pretty much the same as the javadoc):
> (I found these from http://www.mail-archive.com/[EMAIL 
> PROTECTED]/msg03588.html)
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
> It seems most of the nutch-user discussions I've seen so far relate to the 
> simple merge command.  Are the first three "advanced commands"?  
>

They serve different purpose - let's assume that somehow you've got two 
crawldb-s, e.g. you ran two crawls with different seed lists and 
different filters. Now you want to take these collections of urls and 
create a one big crawl. Then you would use mergedb to merge crawldb-s, 
mergelinkdb to merge linkdb-s, and mergesegs to merge segments ;)

And a simple "merge" merges indexes of multiple segments, which is a 
performance-related step in the regular Nutch work-cycle.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to