[Nutch-dev] [ nutch-Bugs-681251 ] redirect handling

SourceForge.net Tue, 01 Jun 2004 12:22:11 -0700

Bugs item #681251, was opened at 2003-02-05 21:14
Message generated for change (Comment added) made by guinsu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=681251&group_id=59548


Category: fetcher
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Doug Cutting (cutting)
Assigned to: tom (parliaments)
Summary: redirect handling

Initial Comment:
Yesterday we spoke about various ways of handling
redirects.

The current de-facto approach is to just treat the
redirected url as an alternate name for the page. 
Duplicate elimination should make it so that only one
of the names is ever shown in search results.

However our duplicate elimination fails for pages which
change frequently, e.g., through the inclusion of ads,
from changing headlines, etc.

Is there a better way to deal with redirects that does
not overly-complicate the system?


----------------------------------------------------------------------

Comment By: Tim Patton (guinsu)
Date: 2004-06-01 19:20

Message:
Logged In: YES 
user_id=77755

Would it make sense to record the url as the LAST url in a 
series of redirects.  i.e. if many there are many links to 
foo.com from bar.com?redir?id=1, bar.com/redir?id=2, etc.. 
then the name and the title saved should be from the last 
page, foo.com, not from bar.com with all its redirects.  This 
seems like it would eliminate most duplicates.  Plus the page 
being indexed would be the actual page the user would get, 
not a page that sent them through 1 or more redirects.  I 
believe this is how search engines such as google work, 
otherwise they would be overwhelmed by linkshare redirects 
and the like.

----------------------------------------------------------------------

Comment By: Doug Cutting (cutting)
Date: 2003-02-18 19:15

Message:
Logged In: YES 
user_id=21778

I thought of another way of  dealing with redirects.

The fetcher could treat them as a page containing a single
link, and don't, at fetch time follow that link, rather just
report it, so that its target is added to the database as
another page.  Then the indexer could just ignore these
pages, since they don't have any content.  If the target
changes, then we'll note that the next time we fetch.  We
don't have to worry about adding duplicates, since the
source pages are never indexed, only targets with real content.

The only downside is that if you search on terms that are
only in the redirected URL, not in the target url or the
content, then you won't find the page.  The best way to fix
this would be to get the list of URLs that redirected to a
page added to that page's fetchlist, like its incoming
anchor list.

----------------------------------------------------------------------

Comment By: tom (parliaments)
Date: 2003-02-06 08:18

Message:
Logged In: YES 
user_id=663315

Another possible concern is the extra bandwidth consumption.
 If a site has many inlinks that point to
http://foo.com/path, but which all redirect to
http://www.foo.com/path, we can wind up fetching large
portions of the site repeatedly- once per inlink.  

This is something to keep an eye on as we try to scale- I
have no idea how prevalent it is.  If it is common, we may
want to check against a table of known URLs before redirecting.

A similar issue is redirects to 404 pages.  A fair number of
sites seem to do this- one domain in a test set I used had
multiple hosts all redirecting 404s to a common host,
leading to a backlog of identical requests.  This is
something the fetcher can look for on a per-run basis, but
could also be handled by passing redirect directives
"through the loop" (possibly keeping link table entries) or
keeping a table of known URLs.

----------------------------------------------------------------------

Comment By: tom (parliaments)
Date: 2003-02-06 08:17

Message:
Logged In: YES 
user_id=663315

Another possible concern is the extra bandwidth consumption.
 If a site has many inlinks that point to
http://foo.com/path, but which all redirect to
http://www.foo.com/path, we can wind up fetching large
portions of the site repeatedly- once per inlink.  

This is something to keep an eye on as we try to scale- I
have no idea how prevalent it is.  If it is common, we may
want to check against a table of known URLs before redirecting.

A similar issue is redirects to 404 pages.  A fair number of
sites seem to do this- one domain in a test set I used had
multiple hosts all redirecting 404s to a common host,
leading to a backlog of identical requests.  This is
something the fetcher can look for on a per-run basis, but
could also be handled by passing redirect directives
"through the loop" (possibly keeping link table entries) or
keeping a table of known URLs.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=681251&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [ nutch-Bugs-681251 ] redirect handling

Reply via email to