Hi,

I am having issues crawling an intranet site with an (imho) odd redirect mechanism. One part of the intranet website requires authentication which Nutch can bypass sending a special http.agent.name. This works fine.

The issue I am facing is that the server sends a redirect (302) after successful authentication to the same URL. Nutch is not following the redirect. My guess is that Nutch omits the site because it has been visited before...

Any pointers on how to overcome this and index the site after the redirect happened are very welcome. My configuration is below.
Thanks a lot,
Elisabeth


I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10
plugin.includes = protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

Reply via email to