Bug: redirected URLs lost on indexing stage?

Arkadi.Kosmynin Wed, 28 Oct 2015 00:58:52 -0700

Hi,

I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark in 
the subject because I work with Nutch modification called Arch (see 
http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99% 
sure that the same bug would occur in the original Nutch 1.9.


In my experience, Nutch follows redirects OK (after NUTCH-2124 applied), 
fetches target content, parses and saves it, but loses on the indexing stage. 
This happens because the db datum is being mapped with the original URL as the 
key, but the fetch and parse data and parse text are being mapped with the 
final URL in IndexerMapReduce. Therefore, when this condition is checked

if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == 
null) {
      return;                                     // only have inlinks
    }

both sets get ignored because each one is incomplete.

I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This 
is because I am not completely sure that this is a bug in Nutch (see above) and 
also because what will work for Arch may not work for Nutch. They are different 
in the use of crawl db.

Regards,

Arkadi

Bug: redirected URLs lost on indexing stage?

Reply via email to