Re: [Nutch-general] four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Dennis Kubes Wed, 18 Jul 2007 18:57:54 -0700

If I am reading the message right :) then yes that problem would have 
been fixed by now.  I believe that problem was with an earlier version 
of Nutch (0.7).


Dennis Kubes

Kai_testing Middleton wrote:
> Am I correct that the 'new' mergedb and mergelinkdb commands together would 
> fix this problem from April 2006
> http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04112.html
> 
> 
>                       Re: Query on merged indexes returned 0 hit - test case 
> included (Nutch 0.8)
>                       Andrzej Bialecki
> 
>                       Tue, 04 Apr 2006 09:29:51 -0700
> 
>               
> 
> 
> 
> 
> 
> 
> 
> 
> Olive g wrote:
> Hi Andrzej & other gurus who might be reading this message :-):
> 
> I ran some tests and somehow my query returned 0 hit against merged 
> indexes. Here is my test case and it's a bit long, thank you in 
> advance for your patience:
> 1. crawled the first 100 urls
> 
>   ~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1 >& 
> test1.log&
> 2. set searcher.dir to test1
> 
> 3. query for "movie"
> ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
> 
>  it returned 64 hits (a web research with tomcat returned the same 
> result)
> 4. crawled the second 100 urls
> 
> ~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1 >& 
> test2.log&
> 5. set searcher.dir to test2
> 
> 6. query for "movie"
>  ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
>  it returned 55 hits (a web research with tomcat returned the same 
> result)
> 7.  attempted to merge using the following command:
>  ../search/bin/nutch merge test3 test1 test2 >& merge-test3&
>  returned error:
>  Exception in thread "main" java.rmi.RemoteException: 
> java.io.IOException: Cannot
> open filename /user/root/test1/crawldb/segments
>        at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120)
> 
> 8.  attempted to merge again using the following command:
> ../search/bin/nutch merge test4 test1/indexes test2/indexes >& 
> merge-test4&
>   merged successfully with no errors
> 
> 9. set searcher.dir to test4
> 
> 10.  query for "movie" by:
>   ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
>  and it returned 0 hit (a web research with tomcat returned the same 
> result)
>  060403 201545 10 opening segments in test4/segments
>  060403 201545 10 found resource common-terms.utf8 at
>  file:/root/nutch/search/conf/common-terms.utf8
>  060403 201545 10 opening linkdb in test4/linkdb
>  Total hits: 0
> 
> It appeared to be looking for test4/segments and test4/linkdb which 
> did not exist?
> Well, the short answer is that you cannot at the moment merge crawldbs 
> or linkdbs. As a consequence, you cannot use multiple outputs of 'nutch 
> crawl' together (because NutchBean needs to reference a single linkdb 
> during searching).
> This is technically possible, but simply not implemented (yet).
> 
> --
> Best regards,
> Andrzej Bialecki
> 
> ----- Original Message ----
> From: Doğacan Güney <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, July 16, 2007 1:59:39 PM
> Subject: Re: four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge
> 
> Hi
> 
> On 7/16/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
>> I've been reviewing the four different merge commands (as of nutch v0.9):
>>
>> $ nutch | grep merg
>>   mergedb           merge crawldb-s, with optional filtering
>>   mergesegs         merge several segments, with optional filtering and 
>> slicing
>>   mergelinkdb       merge linkdb-s, with optional filtering
>>   merge             merge several segment indexes
>>
>> Here are the javadocs:
>> mergedb -- 
>> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
>> mergesegs -- 
>> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
>> mergelinkdb -- 
>> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
>> merge -- 
>> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html
>>
>> Naively: why are there four merge commands? Are some subsets of the others?  
>> Are they used in conjunction? What are the usage scenarios of each?
> 
> Each is used in a different scenario
> mergedb: as its name does not imply, it is used to merge crawldb. So
> consider this mergecrawldb
> 
> mergesegs: merges segments. It merges <segment>/{content,crawl_fetch,
> crawl_generate, crawl_parse, parse_data, parse_text} information from
> different segments.
> 
> merge: Merges lucene indexes. After a index job, you end up with a
> indexes directory with a bunch of part-<num> directories inside.
> Command merge takes such a directory and produces a single index. A
> single index has a better performance (I think). You can say that
> merge is poorly named, it should have been called mergeindexes or
> something.
> 
> mergelinkdb: Should be obvious, merges linkdb-s.
> 
> So none of them is a subset of another. They all have different
> purposes. It is kind of confusing to have a "merge" command that only
> merges indexes, so perhaps we can add a mergeindexes command, keep
> merge for some time (noting that it has been deprecated) then remove
> it.
> 
>> I notice that Andrzej wrote the first three, and they have wiki entries 
>> (pretty much the same as the javadoc):
>> (I found these from http://www.mail-archive.com/[EMAIL 
>> PROTECTED]/msg03588.html)
>> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
>> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
>> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
>> It seems most of the nutch-user discussions I've seen so far relate to the 
>> simple merge command.  Are the first three "advanced commands"?
>>
> 
> 
> 
> 
> 
> 
> 
> 
>        
> ____________________________________________________________________________________
> Got a little couch potato? 
> Check out fun summer activities for kids.
> http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz
>  

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Reply via email to