[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880026#comment-13880026 ]
Markus Jelsma commented on NUTCH-1113: -------------------------------------- I have tried running long sequences with random input, but nothing happenend as you have seen. I haven't tried fetch_retry yet but don't think that will help so much. I'll report back after i've fixed some other mess :) > Merging segments causes URLs to vanish from crawldb/index? > ---------------------------------------------------------- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.3 > Reporter: Edward Drapkin > Priority: Blocker > Fix For: 1.9 > > Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, > unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)