Joel Halbert wrote:
Hi All,
I think we've stumbled across a possible bug in the way Nutch (using
1.0) handles redirects and fetches relative links when crawling.
I have set the crawler up so that it follows links with query strings,
such as <a href="bla?a=1&b=2" />
I have also set the crawler to follow redirects.
The problem I am seeing is as follows, using the actual urls:
1. Nutch hits site:
www.honestjons.com
2. Request is redirected to page:
http://www.honestjons.com/shop.php
3. Nutch extracts links from page, e.g:
<a class = 'fix_png' href='?pid=34531&CatID=124'>
4. Nutch then fetches this page... because of the redirect, the url that
is fetched should be:
http://www.honestjons.com/shop.php?pid=34531&CatID=124
However the url the Nutch actually fetches is:
http://www.honestjons.com/?pid=34531&CatID=124
i.e. it has not applied the change in path because of the redirect to
subsequent page requests.
When previously building our own crawler we saw exactly the same issue,
the solution is simply to check for redirects and apply the new path as
the current base when constructing new urls from relative links on the
current page.
If people are in agreement I will raise a bug ticket, or else please let
me know if there is something (configuration?) that we have missed.
You should create a bug ticket - this is an issue with relative links
resolution, and actually the bug exists in java.net.URL - we need to
work around this.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com