[Nutch-dev] Re: dedup and redirect handling

Doug Cutting Mon, 18 Apr 2005 10:30:11 -0700

Luke Baker wrote:

When Nutch attempts to fetch a URL that replies with a redirect, Nutch will follow the redirect and download the page. However that content is then credited to the original link and not the URL that we actually downloaded the content from. Consider the example where we have the true URL (www.example.com) as one of our seed URLs. Later we crawl a URL that redirects to www.example.com (www.somewackysite.com/?redir=18903). The content gets associated with www.somewackysite.com/?redir=18903. When we run dedup, it finds duplicate content hashes and deletes the one for www.example.com because that was fetched prior to www.somewackysite.com/?redir=18903. The content for www.example.com is still available for searching, but the valuable anchor text for links to www.example.com is lost.

When multiple pages have the same content hash, then the page with the higher score is used. If scores match, then the shorter URL is used. The more recent page is only used when the URLs match.

Doug


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: dedup and redirect handling

Reply via email to