Nutch 1.0 running on Windows 2003 Server hitting a local Sharepoint site 
running under IIS that has been configured to require domain authentication. 
Hitting the site with the top-level URL, shown in the logs below, works both on 
the machine it is running on and on external machines with IE7, Firefox 3.x, 
and Google Chrome 2.x.

In my httpclient-auth.xml file I have the following:
<auth-configuration>
    <credentials username="EdgeSearch" password="SearchPassword">
      <default scheme="ntlm" realm="smb-edge-dev" />
    </credentials>
</auth-configuration>

Note that I have tried leaving out the "realm" attribute, fully qualifying the 
username to "smb-edge-dev\EdgeSearch", leaving out the domain as part of the 
username, and leaving out the "scheme" attribute. The results are consistent.

Those are the *only* credentials specified in the httpclient-auth.xml file. I 
do not have any other credentials for any other sites in the config. I'm only 
crawling this one site.

I set the general log level to DEBUG to get more information from the log. The 
lines from the log that are of interest to me are:
2009-06-16 11:01:42,487 DEBUG http.Http - fetching 
http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,487 DEBUG http.Http - fetched 1656 bytes from 
http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,539 DEBUG http.Http - 401 Authentication Required
2009-06-16 11:01:55,087 DEBUG crawl.Generator - -shouldFetch rejected 
'http://smb-edge-dev:8082/default.aspx', fetchTime=1249056102539, 
curTime=1245168109986

What that is telling me, please correct me if I am wrong, is that Nutch is 
hitting the target site as requested and, as expected, is receiving a 401 
requesting authentication.

There is *nothing* in the log file that indicates that authentication has 
failed. There are no ERROR level messages anywhere in the log. It goes from the 
401 to rejecting the page and I have no idea why.

Suggestions are *more* than welcome.

rjsjr

Reply via email to