Hello again,
Did you set the 'http.agent.host' in 'conf/nutch-site.xml' ? I didn't have it set, but have now set it. <property> <name>http.agent.host</name> <value>serverB.domain.com</value> </property> #1 didn't work. #2 ended up working. Though the user id needs additional permissions as we're seeing but it's working nonetheless. -----Original Message----- From: Susam Pal [mailto:[email protected]] Sent: Tuesday, March 31, 2009 10:44 AM To: [email protected] Subject: Re: Nutch 1.0 - NTLM question Hi Austin, I read the logs and I went back to the code too <http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?revision=749247&view=markup>. However, I don't find anything unusual that could cause this authentication problem. I just want to check another point even though it is not very important. Did you set the 'http.agent.host' in 'conf/nutch-site.xml' ? I would like to know the following: 1. Whether this works: <credentials username="user" password="pass"> <default/> </credentials> 2. Whether this works: <credentials username="user" password="pass"> <authscope host="server.domain.com" port="80"/> </credentials> 3. Whether this works: <credentials username="user" password="pass"> <authscope host="server.domain.com" port="80" scheme="NTLM"/> </credentials> If possible, please provide me the relevant logs for each of these three cases. Regards, Susam Pal On Tue, Mar 31, 2009 at 9:44 PM, Austin, David <[email protected]> wrote: > Hi Susam, > > Thanks for your quick response. I've gone through the "Need Help" section. > Modified a few things accordingly. > > Turned on the debugging using: > log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout > > I had missed the following in nutch-site.xml, so I've since added that so I > now see it trying to authenticate. > <property> > <name>plugin.includes</name> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > </property> > > In my logs, whether I have NTLM selected or not in http-client-auth.xml, I > see the following (note: I've tried domain\user and just user with the realm > as the domain and neither work): > > 2009-03-31 10:07:03,601 DEBUG httpclient.Http - Credentials - username: user; > set for AuthScope - host: server.domain.com; port: 80; realm: domain; scheme: > 2009-03-31 10:07:03,648 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=8 > 2009-03-31 10:07:03,648 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=7 > 2009-03-31 10:07:03,664 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=6 > 2009-03-31 10:07:03,664 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=5 > 2009-03-31 10:07:03,664 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=4 > 2009-03-31 10:07:03,680 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=3 > 2009-03-31 10:07:03,680 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=2 > 2009-03-31 10:07:03,695 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2009-03-31 10:07:03,758 INFO auth.AuthChallengeProcessor - ntlm > authentication scheme selected > 2009-03-31 10:07:04,070 INFO httpclient.HttpMethodDirector - Failure > authenticating with NTLM <any realm>@server.domain.com:80 > 2009-03-31 10:07:04,070 DEBUG httpclient.Http - url: > http://server.domain.com/secured; status code: 401; bytes received: 24; > Content-Length: 24 > 2009-03-31 10:07:04,117 DEBUG httpclient.Http - 401 Authentication Required > > -----Original Message----- > From: Susam Pal [mailto:[email protected]] > Sent: Tuesday, March 31, 2009 9:58 AM > To: [email protected] > Subject: Re: Nutch 1.0 - NTLM question > > On Tue, Mar 31, 2009 at 9:03 PM, Austin, David <[email protected]> > wrote: >> Got Nutch 1.0 setup fairly easily and even did a couple crawls. Very >> pleased with the results so far. However, now I am trying to get the >> NTLM portion to work. >> >> Following the instructions here: >> http://wiki.apache.org/nutch/HttpAuthenticationSchemes >> <http://wiki.apache.org/nutch/HttpAuthenticationSchemes> >> >> My httpclient-auth.xml looks as follows: >> >> <auth-configuration> >> <credentials username="user" password="pass"> >> <authscope host="server.domain.com" port="80" realm="domain.com" >> scheme="NTLM"/> >> </credentials> >> </auth-configuration> > > Hi David, > > For troubleshooting, I would suggest that you start with the simplest > configuration for authentication. The simplest configuration contains > only the default authentication scope. > > <credentials username="susam" password="masus"> > <default/> > </credentials> > > This is discussed in > http://wiki.apache.org/nutch/HttpAuthenticationSchemes in section > "Crawling an Intranet with Default Authentication Scope". If this > doesn't work fine, please go to "Need Help?" section in the same wiki > article and follow the checklist and send us the relevant log files. > If this goes fine, probably the authentication scope is not configured > properly. You could ensure that the server indeed requires NTLM > authentication and not Basic or Digest authentication. The realm value > is another thing that could go wrong. > >> >> Is this the correct setup for NTLM? At present I'm only receiving 401's >> so it doesn't appear to be working in this setup. Basic auth would look >> like "domain\user" if we were to login that way in case you're curious. > > I doubt that the value you have put in realm is correct. If you visit > the page you are trying to crawl using a browser, what credentials do > you enter? If you enter "domain\user" as the user name then only > "domain" should go as the value of realm. However, before configuring > authentication scope for NTLM scheme, I would suggest that you first > get the default authentication scope working and then proceed with the > configuration for NTLM authentication scheme. > >> >> I noticed that for 0.9 there were properties that had to be setup in >> nutch-site.xml; is that still the case? Refering to this link: >> http://www.mail-archive.com/[email protected]/msg02102.htm >> l >> Here it looks like several http.auth.username and http.auth.password >> have to be set. Based on what I read though that's not needed anymore >> in 1.0 based upon [NUTCH-559: >> https://issues.apache.org/jira/browse/NUTCH-559], correct? > > Yes, you are right. http.auth.username and http.auth.password are not > required. They were present during the development of this feature but > they were removed as the development progressed. > > Regards, > Susam Pal > > This email communication and any files transmitted with it may contain > confidential and or proprietary information and is provided for the use of > the intended recipient only. Any review, retransmission or dissemination of > this information by anyone other than the intended recipient is prohibited. > If you receive this email in error, please contact the sender and delete this > communication and any copies immediately. Thank you. > http://www.encana.com > >
