Thanks Eelco.
I applied the patch that you suggested. It works perfect when the 
redirect does return a new URL.
However, when the redirect does not return a new URL (no URL or the same 
URL), the status is not updated. Wouldn't it be better to set the status 
to STATUS_FETCH_GONE in that case? Something like below.

Fetcher.java:
...
if (newUrl != null && !newUrl.equals(url.toString())) { 
    output(url, datum, null, CrawlDatum.STATUS_FETCH_SUCCESS);
    url = new UTF8(newUrl);
    redirecting = true;
    redirectCount++;
    if (LOG.isDebugEnabled()) {
        LOG.debug(" - protocol redirect to " + url);
    }
} else if (LOG.isDebugEnabled()) {
    output(url, datum, null, CrawlDatum.STATUS_FETCH_GONE);
    LOG.debug(" - protocol redirect skipped: " +
        (newUrl != null ? "to same url" : "filtered"));
}
...

Mathijs        

Eelco Lempsink wrote:
> On 13-jan-2007, at 14:34, Mathijs Homminga wrote:
>> I'm using nutch 0.8.1 and I noticed the following.
>> When pageA redirects to pageB (HTTP 3xx), pageA remains unfetched in 
>> the crawlDB (pageB is fetched).
>>
>> Hence, pageA shows up in each generate/fetch/updatedb iteration.
>>
>> Is this a bug? I found a previous thread on this list which describes 
>> this issue too:
>> http://www.mail-archive.com/[email protected]/msg04599.html
>
> Yes.  See http://issues.apache.org/jira/browse/NUTCH-273
>
> --Regards,
>
> Eelco Lempsink
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to