Re: Not able to index url which is giving http 302

lewis john mcgibbney Thu, 15 Sep 2011 09:10:26 -0700

This looks pretty tricky. I am not experienced with using http-client in
general and we could do with getting a wiki page established to comment on
the re-direct policies and scenarios as there is quite a bit of confusion
within the community as to what some 'states' actually mean and how to
crawl/index the pages.

To address you problem specifically, as you said your log output suggests
that basic authentication passes but that nothing is fetched due to the
redirect. How large is the site you are trying to crawl? Does your
http.content.limit property accommodate this?

Where are you getting the info on the 302 redirect moved temp? from reading
or dumping crawldb stats, surely there must be more information available to
narrow the problem area down here.

On Tue, Sep 13, 2011 at 10:41 AM, Anshuman Mor <[email protected]>wrote:

> Hi Lewis,
>
> My Fault, sorry for that..<br/>
>
> I had enabled some of the logging for httpclient. Please find attached log
> file.<br/>
>
> Please let me know if you need more information on this.<br/>
> http://lucene.472066.n3.nabble.com/file/n3332184/hadoop.log hadoop.log
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Not-able-to-index-url-which-is-giving-http-302-tp3329755p3332184.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Not able to index url which is giving http 302

Reply via email to