Hi Kartik, I had a similar enquiry a long time ago and from what I remember, Nutch will save the new URL and crawl it in the future...which is not the needed behavior here.
To solve this problem, I've customized my protocol-httpclient (HttpResponse class) to just open the 2nd URL right after the first one. Crawling internal websites generally needs a lot of customization (authentication with post request, javascript redirection, NTLM authentication ...). And my general choice was to create "handlers" that are called in HttpResponse depending on the site to be crawled. Maybe plugins could be used but I thought it was a little bit overkill for the job. I hope that helps! Remi On Tue, Dec 2, 2014 at 4:51 PM, Krishnanand, Kartik < [email protected]> wrote: > Hi, > > I am crawling an internal site where the URL that I want to crawl. I hope > that someone can help > > When I load this URL in the browser, it does a 301 redirect to another URL > that sets up cookies that will expire until end of session. When I load > the URL again in the browser, I am now able to load the URL. > > I don't know how to simulate this in my crawler setting. I am aware of > "http.redirect.max" configuration in our nutch configuration XMLs. But if > I understand this correctly, the crawler will follow the redirect and not > come back to original URL. Is my understanding correct? > > How would I be able to crawl this URL? > > Thanks, > > Kartik > > ---------------------------------------------------------------------- > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the > intended recipient, please delete this message. >

