Possible bug in when fetching page relative links after redirects - N 1.0.

Joel Halbert Thu, 30 Apr 2009 14:01:29 -0700

Hi All,

I think we've stumbled across a possible bug in the way Nutch (using
1.0) handles redirects and fetches relative links when crawling.



I have set the crawler up so that it follows links with query strings,
such as <a href="bla?a=1&b=2" /> 

I have also set the crawler to follow redirects.

The problem I am seeing is as follows, using the actual urls:

1. Nutch hits site:
www.honestjons.com 


2. Request is redirected to page: 
http://www.honestjons.com/shop.php


3. Nutch extracts links from page, e.g:
<a class = 'fix_png' href='?pid=34531&amp;CatID=124'>


4. Nutch then fetches this page... because of the redirect, the url that
is fetched should be:
http://www.honestjons.com/shop.php?pid=34531&CatID=124 
However the url the Nutch actually fetches is:
http://www.honestjons.com/?pid=34531&CatID=124 

i.e. it has not applied the change in path because of the redirect to
subsequent page requests.

When previously building our own crawler we saw exactly the same issue,
the solution is simply to check for redirects and apply the new path as
the current base when constructing new urls from relative links on the
current page. 

If people are in agreement I will raise a bug ticket, or else please let
me know if there is something (configuration?) that we have missed.

Thx,

Joel

Possible bug in when fetching page relative links after redirects - N 1.0.

Reply via email to