Merging segments causes URLs to vanish from crawldb/index? ----------------------------------------------------------
Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira