[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876426#comment-13876426 ]
Markus Jelsma commented on NUTCH-1113: -------------------------------------- I have to reindex my control cluster segment by segment in chronological order because NUTCH-1706 was not enabled when i reindexed it last friday. According to some test segments that should decrease the size of the control cluster by properly deleting some redirects! > Merging segments causes URLs to vanish from crawldb/index? > ---------------------------------------------------------- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.3 > Reporter: Edward Drapkin > Priority: Blocker > Fix For: 1.9 > > Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)