RE: Bug: redirected URLs lost on indexing stage?

Arkadi.Kosmynin Mon, 02 Nov 2015 16:23:01 -0800

Hi Sebastian,

Thank you for very quick and detailed response. I've checked again and found 
that redirected URLs get lost if they had been injected in the last iteration.


Example: use http://www.atnf.csiro.au/observers/  as seed and set depth to 1. 
It will be redirected to http://www.atnf.csiro.au/observers/index.html, fetched 
and parsed successfully and then lost. If you set depth to 2, it will get 
indexed.

If you use http://www.atnf.csiro.au/observers/index.html as seed, it will be 
fetched, parsed and indexed successfully even if you set depth to 1.

 Regards,
Arkadi

> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Thursday, 29 October 2015 7:23 AM
> To: [email protected]
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied),
> 
> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
> 
> 
> > fetches target content, parses and saves it, but loses on the indexing 
> > stage.
> 
> Can you give a concrete example?
> 
> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> 
> 
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >       return;                                     // only have inlinks
> >     }
> >
> > both sets get ignored because each one is incomplete.
> 
> This code snippet is correct, a redirect is pretty much the same as a link: 
> the
> crawler follows it. Ok, there are many differences, but the central point: a
> link does not get indexed, but only the link target. And that's the same for
> redirects. There are always at least 2 URLs:
> - the source or redirect
> - and the target of the redirection
> Only the latter gets indexed after it has been fetched and it is not a 
> redirect
> itself.
> 
> The source has no parseText and parseData, and that's why cannot be
> indexed.
> 
> If the target does not make it into the index:
> - first, check whether it passes URL filters and is not changed by normalizers
> - was it successfully fetched and parsed?
> - not excluded by robots=noindex?
> 
> You should check the CrawlDb and the segments for this URL.
> 
> If you could provide a concrete example, I'm happy to have a detailed look
> on it.
> 
> Cheers,
> Sebastian
> 
> 
> On 10/28/2015 08:57 AM, [email protected] wrote:
> > Hi,
> >
> > I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question
> mark in the subject because I work with Nutch modification called Arch (see
> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only
> 99% sure that the same bug would occur in the original Nutch 1.9.
> >
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied), fetches target content, parses and saves it, but loses on
> > the indexing stage. This happens because the db datum is being mapped
> > with the original URL as the key, but the fetch and parse data and
> > parse text are being mapped with the final URL in IndexerMapReduce.
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >       return;                                     // only have inlinks
> >     }
> >
> > both sets get ignored because each one is incomplete.
> >
> > I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. 
> > This is
> because I am not completely sure that this is a bug in Nutch (see above) and
> also because what will work for Arch may not work for Nutch. They are
> different in the use of crawl db.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> >

RE: Bug: redirected URLs lost on indexing stage?

Reply via email to