[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869441#comment-13869441 ]
Markus Jelsma commented on NUTCH-1113: -------------------------------------- Another record is also missing {code} Segment: 20131219031127 Segment: Version: 7 Status: 33 (fetch_success) Fetch time: Thu Dec 19 03:16:38 UTC 2013 Modified time: Mon Feb 11 14:12:46 UTC 2013 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.0 Signature: e296447f874bb33ad68c23b5db06750e Metadata: _ngt_=1387422589128 hubpage=0.032560557 numOutlinks=291 Content-Type=application/xhtml+xml _pst_=success(1), lastModified=0 adult=0.008212954 Segment: 20131230082800 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 30 08:31:17 UTC 2013 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.0 Signature: null Metadata: _ngt_=1388391979616 Content-Type=text/html _pst_=moved(12), lastModified=0: http://www.example.org/beleid/water-milieu-en-veiligheid/ _repr_=http://www.example.org/beleid/water-milieu-en-veiligheid/ {code} This record is only indexed by the segment without LINKED, NUTCH-1616 and trunk. But not by NUTCH-1113 or Sebastian's patch. > Merging segments causes URLs to vanish from crawldb/index? > ---------------------------------------------------------- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.3 > Reporter: Edward Drapkin > Priority: Blocker > Fix For: 1.9 > > Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)