Re: Possible bug in when fetching relative links after a redirect - N 1.0

Andrzej Bialecki Wed, 29 Apr 2009 03:15:46 -0700

Joel Halbert wrote:

Hi All,


I think we've stumbled across a possible bug in the way Nutch (using
1.0) handles redirects and fetches relative links when crawling.


I have set the crawler up so that it follows links with query strings,

such as <a href="bla?a=1&b=2" />

I have also set the crawler to follow redirects.

The problem I am seeing is as follows, using the actual urls:

1. Nutch hits site:

www.honestjons.com

2. Request is redirected to page:http://www.honestjons.com/shop.php



3. Nutch extracts links from page, e.g:
<a class = 'fix_png' href='?pid=34531&amp;CatID=124'>


4. Nutch then fetches this page... because of the redirect, the url that
is fetched should be:

http://www.honestjons.com/shop.php?pid=34531&CatID=124However the url the Nutch actually fetches is:http://www.honestjons.com/?pid=34531&CatID=124

i.e. it has not applied the change in path because of the redirect to
subsequent page requests.

When previously building our own crawler we saw exactly the same issue,
the solution is simply to check for redirects and apply the new path as
the current base when constructing new urls from relative links on the

current page.

If people are in agreement I will raise a bug ticket, or else please let
me know if there is something (configuration?) that we have missed.

You should create a bug ticket - this is an issue with relative linksresolution, and actually the bug exists in java.net.URL - we need towork around this.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Possible bug in when fetching relative links after a redirect - N 1.0

Reply via email to