I installed "Fiddler" as a proxy on the server and compared the sessions from IE and Nutch. When IE receives the 401 it will then create a new request with the NTLM authentication tokens for which it receives a 200. When Nutch receives the 401 it does not make another request.
This implies to me that the credential that I've added to httpclient-auth.xml are being ignored. Is there something that I need to set in nutch-site.xml to enable authentication? Is there another configuration option that I've missed somewhere? Many thanks! rjsjr -----Original Message----- From: Robert Sanford [mailto:[email protected]] Sent: Tuesday, June 16, 2009 11:27 AM To: [email protected] Subject: NTLM Authentication Not Occuring... Nutch 1.0 running on Windows 2003 Server hitting a local Sharepoint site running under IIS that has been configured to require domain authentication. Hitting the site with the top-level URL, shown in the logs below, works both on the machine it is running on and on external machines with IE7, Firefox 3.x, and Google Chrome 2.x. In my httpclient-auth.xml file I have the following: <auth-configuration> <credentials username="EdgeSearch" password="SearchPassword"> <default scheme="ntlm" realm="smb-edge-dev" /> </credentials> </auth-configuration> Note that I have tried leaving out the "realm" attribute, fully qualifying the username to "smb-edge-dev\EdgeSearch", leaving out the domain as part of the username, and leaving out the "scheme" attribute. The results are consistent. Those are the *only* credentials specified in the httpclient-auth.xml file. I do not have any other credentials for any other sites in the config. I'm only crawling this one site. I set the general log level to DEBUG to get more information from the log. The lines from the log that are of interest to me are: 2009-06-16 11:01:42,487 DEBUG http.Http - fetching http://smb-edge-dev:8082/default.aspx 2009-06-16 11:01:42,487 DEBUG http.Http - fetched 1656 bytes from http://smb-edge-dev:8082/default.aspx 2009-06-16 11:01:42,539 DEBUG http.Http - 401 Authentication Required 2009-06-16 11:01:55,087 DEBUG crawl.Generator - -shouldFetch rejected 'http://smb-edge-dev:8082/default.aspx', fetchTime=1249056102539, curTime=1245168109986 What that is telling me, please correct me if I am wrong, is that Nutch is hitting the target site as requested and, as expected, is receiving a 401 requesting authentication. There is *nothing* in the log file that indicates that authentication has failed. There are no ERROR level messages anywhere in the log. It goes from the 401 to rejecting the page and I have no idea why. Suggestions are *more* than welcome. rjsjr
