Perhaps a point of clarification - I'm assuming that the
www.domain.com/?code.asp&redirect=444 actually sends a redirect header
to the new page. In that case (I don't know enough about protocols
personally to be sure) it seems that nutch would have to recognize that
it's being redirected and refetch at the new location. Am I correct?
And if so, wouldn't nutch then index and display the new, redirected page?
I'm using version .7 btw.
thanks,
Glenn
Dennis Kubes wrote:
Protocol level redirects (asp redirects), meaning the server sends a
redirect response 3xx code, work correctly in Nutch 0.8 dev. It
processes it as a completely new page. If you are doing asp forwards
I believe that the original page
(www.domain.com/?code.aspx&redirect=445454) would be the URL that
shows up in the search because Nutch doesn't know what is going on
behind the scenes in the ASP code. It knows url and content recieved.
As of right now in 0.8 dev meta level redirects (meta refesh tags)
don't work correctly. They did in 0.7 but I don't think that
functionality has been ported.
Dennis
Insurance Squared Inc. wrote:
How are redirects listed in version 0.7? If the crawler finds a link
like:
www.domain.com/?code.aspx&redirect=445454
and that link redirects through to www.another-domain.com, which of
those two links will show up in nutch?
(I'm wondering if I can use nutch to crawl sites with a lot of
redirects, and still end up with the correct redirected domain in the
listings).