I followed the script (with minor variations) in the wiki at http://wiki.apache.org/nutch/Crawl however, I think I found another bug. Apply this patch and it will index pages with a status of STATUS_FETCH_NOTMODIFIED as well as STATUS_FETCH_SUCCESS.
Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java =================================================================== --- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 802632) +++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy) @@ -84,8 +84,10 @@ if (CrawlDatum.hasDbStatus(datum)) dbDatum = datum; else if (CrawlDatum.hasFetchStatus(datum)) { - // don't index unmodified (empty) pages - if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) + /* + * Where did this person get the idea that unmodified pages are empty? + // don't index unmodified (empty) pages + if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */ fetchDatum = datum; } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { @@ -108,7 +110,7 @@ } if (!parseData.getStatus().isSuccess() || - fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { + (fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS && fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) { return; } Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java =================================================================== --- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (revision 802632) +++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (working copy) @@ -124,11 +124,15 @@ reqStr.append("\r\n"); } - reqStr.append("\r\n"); if (datum.getModifiedTime() > 0) { reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime())); reqStr.append("\r\n"); } + else if (datum.getFetchTime() > 0) { + reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getFetchTime())); + reqStr.append("\r\n"); + } + reqStr.append("\r\n"); byte[] reqBytes= reqStr.toString().getBytes(); On Tue, Aug 11, 2009 at 5:35 AM, Alex McLintock<alex.mclint...@gmail.com> wrote: > I've been wondering about this problem. When you did the invertlinks > and index steps did you do it just on the current/most recent segment > or all the segments? > > Presumably this is why you tried to do a merge? > > Alex > > 2009/8/10 Paul Tomblin <ptomb...@xcski.com>: >> After applying the patch I sent earlier, I got it so that it correctly >> skips downloading pages that haven't changed. And after doing the >> generate/fetch/updatedb loop, and merging the segments with mergeseg, >> dumping the segment file seems to show that it still has the old >> content as well as the new content. But when I then ran the >> invertlinks and index step, the resulting index consists of very small >> files compared to the files from the previous crawl, indicating that >> it only indexed the stuff that it had newly fetched. I tried the >> NutchBean, and sure enough it could only find things I knew were on >> the newly loaded pages, and couldn't find things that occur hundreds >> of times on the pages that haven't changed. "merge" doesn't seem to >> help, since the resulting merged index is still the same size as >> before merging. > -- http://www.linkedin.com/in/paultomblin