After applying the patch I sent earlier, I got it so that it correctly skips downloading pages that haven't changed. And after doing the generate/fetch/updatedb loop, and merging the segments with mergeseg, dumping the segment file seems to show that it still has the old content as well as the new content. But when I then ran the invertlinks and index step, the resulting index consists of very small files compared to the files from the previous crawl, indicating that it only indexed the stuff that it had newly fetched. I tried the NutchBean, and sure enough it could only find things I knew were on the newly loaded pages, and couldn't find things that occur hundreds of times on the pages that haven't changed. "merge" doesn't seem to help, since the resulting merged index is still the same size as before merging.
Is there a way to fix this, or should I just admit that Nutch is hopelessly broken when it comes to trying to avoid hitting pages that haven't changed and roll out my changes? -- http://www.linkedin.com/in/paultomblin