Hi,
I am having issues crawling an intranet site with an (imho) odd redirect
mechanism. One part of the intranet website requires authentication
which Nutch can bypass sending a special http.agent.name. This works fine.
The issue I am facing is that the server sends a redirect (302) after
successful authentication to the same URL. Nutch is not following the
redirect. My guess is that Nutch omits the site because it has been
visited before...
Any pointers on how to overcome this and index the site after the
redirect happened are very welcome. My configuration is below.
Thanks a lot,
Elisabeth
I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10
plugin.includes =
protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)