Hi Sebastian, Thank you for very quick and detailed response. I've checked again and found that redirected URLs get lost if they had been injected in the last iteration.
Example: use http://www.atnf.csiro.au/observers/ as seed and set depth to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html, fetched and parsed successfully and then lost. If you set depth to 2, it will get indexed. If you use http://www.atnf.csiro.au/observers/index.html as seed, it will be fetched, parsed and indexed successfully even if you set depth to 1. Regards, Arkadi > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: Thursday, 29 October 2015 7:23 AM > To: [email protected] > Subject: Re: Bug: redirected URLs lost on indexing stage? > > Hi Arkadi, > > > In my experience, Nutch follows redirects OK (after NUTCH-2124 > > applied), > > Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0 > > > > fetches target content, parses and saves it, but loses on the indexing > > stage. > > Can you give a concrete example? > > While testing NUTCH-2124, I've verified that redirect targets get indexed. > > > > Therefore, when this condition is checked > > > > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData > == null) { > > return; // only have inlinks > > } > > > > both sets get ignored because each one is incomplete. > > This code snippet is correct, a redirect is pretty much the same as a link: > the > crawler follows it. Ok, there are many differences, but the central point: a > link does not get indexed, but only the link target. And that's the same for > redirects. There are always at least 2 URLs: > - the source or redirect > - and the target of the redirection > Only the latter gets indexed after it has been fetched and it is not a > redirect > itself. > > The source has no parseText and parseData, and that's why cannot be > indexed. > > If the target does not make it into the index: > - first, check whether it passes URL filters and is not changed by normalizers > - was it successfully fetched and parsed? > - not excluded by robots=noindex? > > You should check the CrawlDb and the segments for this URL. > > If you could provide a concrete example, I'm happy to have a detailed look > on it. > > Cheers, > Sebastian > > > On 10/28/2015 08:57 AM, [email protected] wrote: > > Hi, > > > > I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question > mark in the subject because I work with Nutch modification called Arch (see > http://www.atnf.csiro.au/computing/software/arch/). This is why I am only > 99% sure that the same bug would occur in the original Nutch 1.9. > > > > In my experience, Nutch follows redirects OK (after NUTCH-2124 > > applied), fetches target content, parses and saves it, but loses on > > the indexing stage. This happens because the db datum is being mapped > > with the original URL as the key, but the fetch and parse data and > > parse text are being mapped with the final URL in IndexerMapReduce. > > Therefore, when this condition is checked > > > > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData > == null) { > > return; // only have inlinks > > } > > > > both sets get ignored because each one is incomplete. > > > > I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. > > This is > because I am not completely sure that this is a bug in Nutch (see above) and > also because what will work for Arch may not work for Nutch. They are > different in the use of crawl db. > > > > Regards, > > > > Arkadi > > > > > >

