Nutch 1.0 running on Windows 2003 Server hitting a local Sharepoint site
running under IIS that has been configured to require domain authentication.
Hitting the site with the top-level URL, shown in the logs below, works both on
the machine it is running on and on external machines with IE7, Firefox 3.x,
and Google Chrome 2.x.
In my httpclient-auth.xml file I have the following:
<auth-configuration>
<credentials username="EdgeSearch" password="SearchPassword">
<default scheme="ntlm" realm="smb-edge-dev" />
</credentials>
</auth-configuration>
Note that I have tried leaving out the "realm" attribute, fully qualifying the
username to "smb-edge-dev\EdgeSearch", leaving out the domain as part of the
username, and leaving out the "scheme" attribute. The results are consistent.
Those are the *only* credentials specified in the httpclient-auth.xml file. I
do not have any other credentials for any other sites in the config. I'm only
crawling this one site.
I set the general log level to DEBUG to get more information from the log. The
lines from the log that are of interest to me are:
2009-06-16 11:01:42,487 DEBUG http.Http - fetching
http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,487 DEBUG http.Http - fetched 1656 bytes from
http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,539 DEBUG http.Http - 401 Authentication Required
2009-06-16 11:01:55,087 DEBUG crawl.Generator - -shouldFetch rejected
'http://smb-edge-dev:8082/default.aspx', fetchTime=1249056102539,
curTime=1245168109986
What that is telling me, please correct me if I am wrong, is that Nutch is
hitting the target site as requested and, as expected, is receiving a 401
requesting authentication.
There is *nothing* in the log file that indicates that authentication has
failed. There are no ERROR level messages anywhere in the log. It goes from the
401 to rejecting the page and I have no idea why.
Suggestions are *more* than welcome.
rjsjr