Merging segments causes URLs to vanish from crawldb/index?
----------------------------------------------------------

                 Key: NUTCH-1113
                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.3
            Reporter: Edward Drapkin


When I run Nutch, I use the following steps:

nutch inject crawldb/ url.txt

repeated 3 times:

nutch generate crawldb/ segments/ -normalize
nutch fetch `ls -d segments/* | tail -1`
nutch parse `ls -d segments/* | tail -1`
nutch update crawldb `ls -d segments/* | tail -1`

nutch mergesegs merged/ -dir segments/
nutch invertlinks linkdb/ -dir merged/

nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
indexing code from Nutch 1.1).

When I crawl with merging segments, I lose about 20% of the URLs that wind up 
in the index vs. when I crawl without merging the segments.  Somehow the 
segment merger causes me to lose ~20% of my crawl database!


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to