Let's see if I understand this. First, let's focus on the protocol redirect 3xx: Nutch goes and requests the new page under the new URL. If this is the case, I believe there are times when it may be less desirable and that more complicated treatment may be necessary. Please see this recent SE Watch article on the topic: http://blog.searchenginewatch.com/blog/050801-130330
It references Yahoo's redirect policy for Slurp: http://help.yahoo.com/help/us/ysearch/slurp/slurp-11.html Yahoo considers a meta-refresh (referred to as a "forward") to be different depending on the delay time: 302 (short delay) or a 301 (long delay). Nutch's policy differs from Yahoo in that Nutch treats all 3xxs and meta-refreshes identically, always indexing the content under the target URL. Yahoo is more selective. There are several cases when Yahoo indexes the target content under the source URL (like a pointer): on-site 302s and short, on-site meta-refreshes. We might want to consider modifying Nutch to follow these conventions being set by Yahoo (and to some extent, Google). The rationale behind all of this is that most people prefer and link to shorter (homepage) URLs than to longer URLs (try the IEEE homepage). Thoughts? - Jeff > -----Original Message----- > From: Dennis Kubes [mailto:[EMAIL PROTECTED] > Sent: Saturday, April 15, 2006 3:58 PM > To: [email protected] > Subject: Re: redirect treatment > > There are three kinds of "redirects". One is where the > server behind the scenes forwards to a different page and > returns the output. This is usually called a forward. Two > is where the server send a redirect code (usually in the 300 > range). The browser then requests the page it was redirected > to. This is usually called a protocol redirect or just a > redirect in JSP and ASP terms. Three is where the page has a > meta-refresh tag in the header. This is known as a content > redirect or a meta redirect. Here the client doesn't get a > redirect code from the header but after a certain amount of > time will request the page in the url section of the meta-refresh tag. > > If (www.domain.com/?code.asp&redirect=444) sends a forward > then nutch doesn't know anything about it and will just index > the content returned under the original url. If it sends a > protocol redirect, then nutch goes and requests the new page > and will index the new page under the new url. Nutch will > follow redirects up to http.redirect.max times. So if the > redirect page redirects again Nutch will follow that one as > well up to the max times. If the url variable "redirect" is > used to populate a meta-refresh tag then as of right now > Nutch won't follow the redirect. > I think it fails with a NullPointer right now. > > The meta-refresh was working in 7.2 but is broken in 0.8. > Andrzej Bialecki said he was looking into fixing it. Hope > this helps you understand what is happening with the fetch. > > Dennis > > Insurance Squared Inc. wrote: > > Perhaps a point of clarification - I'm assuming that the > > www.domain.com/?code.asp&redirect=444 actually sends a > redirect header > > to the new page. In that case (I don't know enough about protocols > > personally to be sure) it seems that nutch would have to recognize > > that it's being redirected and refetch at the new location. Am I > > correct? And if so, wouldn't nutch then index and display the new, > > redirected page? > > I'm using version .7 btw. > > > > thanks, > > Glenn > > > > > > Dennis Kubes wrote: > > > >> Protocol level redirects (asp redirects), meaning the > server sends a > >> redirect response 3xx code, work correctly in Nutch 0.8 dev. It > >> processes it as a completely new page. If you are doing > asp forwards > >> I believe that the original page > >> (www.domain.com/?code.aspx&redirect=445454) would be the URL that > >> shows up in the search because Nutch doesn't know what is going on > >> behind the scenes in the ASP code. It knows url and > content recieved. > >> As of right now in 0.8 dev meta level redirects (meta refesh tags) > >> don't work correctly. They did in 0.7 but I don't think that > >> functionality has been ported. > >> > >> Dennis > >> > >> Insurance Squared Inc. wrote: > >> > >>> How are redirects listed in version 0.7? If the crawler finds a > >>> link like: > >>> www.domain.com/?code.aspx&redirect=445454 > >>> and that link redirects through to > www.another-domain.com, which of > >>> those two links will show up in nutch? > >>> > >>> (I'm wondering if I can use nutch to crawl sites with a lot of > >>> redirects, and still end up with the correct redirected domain in > >>> the listings). > >>> > >> > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
