Bugs item #681251, was opened at 2003-02-05 21:14 Message generated for change (Comment added) made by guinsu You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=681251&group_id=59548
Category: fetcher Group: None Status: Open Resolution: None Priority: 3 Submitted By: Doug Cutting (cutting) Assigned to: tom (parliaments) Summary: redirect handling Initial Comment: Yesterday we spoke about various ways of handling redirects. The current de-facto approach is to just treat the redirected url as an alternate name for the page. Duplicate elimination should make it so that only one of the names is ever shown in search results. However our duplicate elimination fails for pages which change frequently, e.g., through the inclusion of ads, from changing headlines, etc. Is there a better way to deal with redirects that does not overly-complicate the system? ---------------------------------------------------------------------- Comment By: Tim Patton (guinsu) Date: 2004-06-01 19:20 Message: Logged In: YES user_id=77755 Would it make sense to record the url as the LAST url in a series of redirects. i.e. if many there are many links to foo.com from bar.com?redir?id=1, bar.com/redir?id=2, etc.. then the name and the title saved should be from the last page, foo.com, not from bar.com with all its redirects. This seems like it would eliminate most duplicates. Plus the page being indexed would be the actual page the user would get, not a page that sent them through 1 or more redirects. I believe this is how search engines such as google work, otherwise they would be overwhelmed by linkshare redirects and the like. ---------------------------------------------------------------------- Comment By: Doug Cutting (cutting) Date: 2003-02-18 19:15 Message: Logged In: YES user_id=21778 I thought of another way of dealing with redirects. The fetcher could treat them as a page containing a single link, and don't, at fetch time follow that link, rather just report it, so that its target is added to the database as another page. Then the indexer could just ignore these pages, since they don't have any content. If the target changes, then we'll note that the next time we fetch. We don't have to worry about adding duplicates, since the source pages are never indexed, only targets with real content. The only downside is that if you search on terms that are only in the redirected URL, not in the target url or the content, then you won't find the page. The best way to fix this would be to get the list of URLs that redirected to a page added to that page's fetchlist, like its incoming anchor list. ---------------------------------------------------------------------- Comment By: tom (parliaments) Date: 2003-02-06 08:18 Message: Logged In: YES user_id=663315 Another possible concern is the extra bandwidth consumption. If a site has many inlinks that point to http://foo.com/path, but which all redirect to http://www.foo.com/path, we can wind up fetching large portions of the site repeatedly- once per inlink. This is something to keep an eye on as we try to scale- I have no idea how prevalent it is. If it is common, we may want to check against a table of known URLs before redirecting. A similar issue is redirects to 404 pages. A fair number of sites seem to do this- one domain in a test set I used had multiple hosts all redirecting 404s to a common host, leading to a backlog of identical requests. This is something the fetcher can look for on a per-run basis, but could also be handled by passing redirect directives "through the loop" (possibly keeping link table entries) or keeping a table of known URLs. ---------------------------------------------------------------------- Comment By: tom (parliaments) Date: 2003-02-06 08:17 Message: Logged In: YES user_id=663315 Another possible concern is the extra bandwidth consumption. If a site has many inlinks that point to http://foo.com/path, but which all redirect to http://www.foo.com/path, we can wind up fetching large portions of the site repeatedly- once per inlink. This is something to keep an eye on as we try to scale- I have no idea how prevalent it is. If it is common, we may want to check against a table of known URLs before redirecting. A similar issue is redirects to 404 pages. A fair number of sites seem to do this- one domain in a test set I used had multiple hosts all redirecting 404s to a common host, leading to a backlog of identical requests. This is something the fetcher can look for on a per-run basis, but could also be handled by passing redirect directives "through the loop" (possibly keeping link table entries) or keeping a table of known URLs. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=681251&group_id=59548 ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
