On Fri, May 15, 2009 at 6:57 PM, Rochelle D'souza <[email protected]> wrote: > hi Susam, > very sorry for the mistake in the 1st code. I had put <default/> but omitted > that line when i sent it across to u :(. > > for our intranet sites we do not require a proxy. hence i have now removed > the proxy and ensured its default auth and did a crawl. have attached the > log, still getting the same 401 :(
I have run out of ideas on what might be causing the problem. 2009-05-15 18:44:58,326 DEBUG httpclient.Http - Credentials - username: devadmin; set as default for realm: ; scheme: 2009-05-15 18:44:58,326 DEBUG httpclient.Http - Pre-configured credentials with scope - host: googly; port: 80; not found for url: http://googly/robots.txt 2009-05-15 18:44:58,842 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected 2009-05-15 18:44:59,888 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2009-05-15 18:45:00,810 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@googly:80 2009-05-15 18:45:00,841 DEBUG httpclient.Http - url: http://googly/robots.txt; status code: 401; bytes received: 1539; Content-Length: 1539 2009-05-15 18:45:00,856 DEBUG httpclient.Http - Pre-configured credentials with scope - host: googly; port: 80; found for url: http://googly/ 2009-05-15 18:45:00,856 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected 2009-05-15 18:45:00,888 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@googly:80 This part of the logs show that the 'devadmin' credentials were picked up for the authentication, but the server refused to allow access and returned HTTP 401 response. There is not much I can help here since everything looks to be happening fine except that the server returns an HTTP 401 response. A few other things you could check though I do not think any of these should cause a problem. 1. Does the password have any special characters? If yes, could you try again with a simpler alphanumeric password? 2. Is http.agent.host set properly? This should be the host name or the IP address of the machine on which your crawler is running. 3. Does this configuration help? <credentials username="devadmin" password="password"> <authscope host="googly" port="80"/> </credentials> 4. This one? <credentials username="devadmin" password="password"> <authscope host="googly" port="80" scheme="NTLM"/> </credentials> If nothing helps, may be it is time to put network sniffers such as Wireshark and analyze the HTTP packets to see whether the server or the client is making a mistake here. (There could be a human error too. So don't rule out that option.) It would be worthwhile to compare the traffic between the browser and the server with that of the Nutch and the server. Regards, Susam Pal
