After applying the patch I sent earlier, I got it so that it correctly
skips downloading pages that haven't changed.  And after doing the
generate/fetch/updatedb loop, and merging the segments with mergeseg,
dumping the segment file seems to show that it still has the old
content as well as the new content.  But when I then ran the
invertlinks and index step, the resulting index consists of very small
files compared to the files from the previous crawl, indicating that
it only indexed the stuff that it had newly fetched.  I tried the
NutchBean, and sure enough it could only find things I knew were on
the newly loaded pages, and couldn't find things that occur hundreds
of times on the pages that haven't changed.  "merge" doesn't seem to
help, since the resulting merged index is still the same size as
before merging.

Is there a way to fix this, or should I just admit that Nutch is
hopelessly broken when it comes to trying to avoid hitting pages that
haven't changed and roll out my changes?

-- 
http://www.linkedin.com/in/paultomblin

Reply via email to