Hi,
I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark in
the subject because I work with Nutch modification called Arch (see
http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99%
sure that the same bug would occur in the original Nutch 1.9.
In my experience, Nutch follows redirects OK (after NUTCH-2124 applied),
fetches target content, parses and saves it, but loses on the indexing stage.
This happens because the db datum is being mapped with the original URL as the
key, but the fetch and parse data and parse text are being mapped with the
final URL in IndexerMapReduce. Therefore, when this condition is checked
if (fetchDatum == null || dbDatum == null|| parseText == null || parseData ==
null) {
return; // only have inlinks
}
both sets get ignored because each one is incomplete.
I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This
is because I am not completely sure that this is a bug in Nutch (see above) and
also because what will work for Arch may not work for Nutch. They are different
in the use of crawl db.
Regards,
Arkadi