Hi All, I think we've stumbled across a possible bug in the way Nutch (using 1.0) handles redirects and fetches relative links when crawling.
I have set the crawler up so that it follows links with query strings, such as <a href="bla?a=1&b=2" /> I have also set the crawler to follow redirects. The problem I am seeing is as follows, using the actual urls: 1. Nutch hits site: www.honestjons.com 2. Request is redirected to page: http://www.honestjons.com/shop.php 3. Nutch extracts links from page, e.g: <a class = 'fix_png' href='?pid=34531&CatID=124'> 4. Nutch then fetches this page... because of the redirect, the url that is fetched should be: http://www.honestjons.com/shop.php?pid=34531&CatID=124 However the url the Nutch actually fetches is: http://www.honestjons.com/?pid=34531&CatID=124 i.e. it has not applied the change in path because of the redirect to subsequent page requests. When previously building our own crawler we saw exactly the same issue, the solution is simply to check for redirects and apply the new path as the current base when constructing new urls from relative links on the current page. If people are in agreement I will raise a bug ticket, or else please let me know if there is something (configuration?) that we have missed. Thx, Joel
