Hi Sebastian,

Thank you for very quick and detailed response. I've checked again and found 
that redirected URLs get lost if they had been injected in the last iteration. 

Example: use http://www.atnf.csiro.au/observers/  as seed and set depth to 1. 
It will be redirected to http://www.atnf.csiro.au/observers/index.html, fetched 
and parsed successfully and then lost. If you set depth to 2, it will get 
indexed.

If you use http://www.atnf.csiro.au/observers/index.html as seed, it will be 
fetched, parsed and indexed successfully even if you set depth to 1.

 Regards,
Arkadi

> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Thursday, 29 October 2015 7:23 AM
> To: [email protected]
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied),
> 
> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
> 
> 
> > fetches target content, parses and saves it, but loses on the indexing 
> > stage.
> 
> Can you give a concrete example?
> 
> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> 
> 
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >       return;                                     // only have inlinks
> >     }
> >
> > both sets get ignored because each one is incomplete.
> 
> This code snippet is correct, a redirect is pretty much the same as a link: 
> the
> crawler follows it. Ok, there are many differences, but the central point: a
> link does not get indexed, but only the link target. And that's the same for
> redirects. There are always at least 2 URLs:
> - the source or redirect
> - and the target of the redirection
> Only the latter gets indexed after it has been fetched and it is not a 
> redirect
> itself.
> 
> The source has no parseText and parseData, and that's why cannot be
> indexed.
> 
> If the target does not make it into the index:
> - first, check whether it passes URL filters and is not changed by normalizers
> - was it successfully fetched and parsed?
> - not excluded by robots=noindex?
> 
> You should check the CrawlDb and the segments for this URL.
> 
> If you could provide a concrete example, I'm happy to have a detailed look
> on it.
> 
> Cheers,
> Sebastian
> 
> 
> On 10/28/2015 08:57 AM, [email protected] wrote:
> > Hi,
> >
> > I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question
> mark in the subject because I work with Nutch modification called Arch (see
> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only
> 99% sure that the same bug would occur in the original Nutch 1.9.
> >
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied), fetches target content, parses and saves it, but loses on
> > the indexing stage. This happens because the db datum is being mapped
> > with the original URL as the key, but the fetch and parse data and
> > parse text are being mapped with the final URL in IndexerMapReduce.
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >       return;                                     // only have inlinks
> >     }
> >
> > both sets get ignored because each one is incomplete.
> >
> > I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. 
> > This is
> because I am not completely sure that this is a bug in Nutch (see above) and
> also because what will work for Arch may not work for Nutch. They are
> different in the use of crawl db.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> >

Reply via email to