Crawling and redirects to the same URL

Elisabeth Adler Thu, 15 Sep 2011 12:25:59 -0700

Hi,

I am having issues crawling an intranet site with an (imho) odd redirectmechanism. One part of the intranet website requires authenticationwhich Nutch can bypass sending a special http.agent.name. This works fine.

The issue I am facing is that the server sends a redirect (302) aftersuccessful authentication to the same URL. Nutch is not following theredirect. My guess is that Nutch omits the site because it has beenvisited before...

Any pointers on how to overcome this and index the site after theredirect happened are very welcome. My configuration is below.

Thanks a lot,
Elisabeth


I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10

Crawling and redirects to the same URL

Reply via email to