Hello again,

Did you set the 'http.agent.host' in 'conf/nutch-site.xml' ?
I didn't have it set, but have now set it.
<property>
  <name>http.agent.host</name>
  <value>serverB.domain.com</value>
</property>  

#1 didn't work.

#2 ended up working.  Though the user id needs additional permissions as we're 
seeing but it's working nonetheless. 

-----Original Message-----
From: Susam Pal [mailto:[email protected]] 
Sent: Tuesday, March 31, 2009 10:44 AM
To: [email protected]
Subject: Re: Nutch 1.0 - NTLM question

Hi Austin,

I read the logs and I went back to the code too
<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?revision=749247&view=markup>.

However, I don't find anything unusual that could cause this
authentication problem. I just want to check another point even though
it is not very important. Did you set the 'http.agent.host' in
'conf/nutch-site.xml' ?

I would like to know the following:

1. Whether this works:

<credentials username="user" password="pass">
 <default/>
</credentials>

2. Whether this works:

 <credentials username="user" password="pass">
   <authscope host="server.domain.com" port="80"/>
 </credentials>

3. Whether this works:

 <credentials username="user" password="pass">
   <authscope host="server.domain.com" port="80" scheme="NTLM"/>
 </credentials>

If possible, please provide me the relevant logs for each of these three cases.

Regards,
Susam Pal

On Tue, Mar 31, 2009 at 9:44 PM, Austin, David <[email protected]> wrote:
> Hi Susam,
>
> Thanks for your quick response.  I've gone through the "Need Help" section.  
> Modified a few things accordingly.
>
> Turned on the debugging using:
> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
>
> I had missed the following in nutch-site.xml, so I've since added that so I 
> now see it trying to authenticate.
>  <property>
>  <name>plugin.includes</name>
>  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  </property>
>
> In my logs, whether I have NTLM selected or not in http-client-auth.xml, I 
> see the following (note: I've tried domain\user and just user with the realm 
> as the domain and neither work):
>
> 2009-03-31 10:07:03,601 DEBUG httpclient.Http - Credentials - username: user; 
> set for AuthScope - host: server.domain.com; port: 80; realm: domain; scheme:
> 2009-03-31 10:07:03,648 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=8
> 2009-03-31 10:07:03,648 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=7
> 2009-03-31 10:07:03,664 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=6
> 2009-03-31 10:07:03,664 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=5
> 2009-03-31 10:07:03,664 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=4
> 2009-03-31 10:07:03,680 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=3
> 2009-03-31 10:07:03,680 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=2
> 2009-03-31 10:07:03,695 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=1
> 2009-03-31 10:07:03,758 INFO  auth.AuthChallengeProcessor - ntlm 
> authentication scheme selected
> 2009-03-31 10:07:04,070 INFO  httpclient.HttpMethodDirector - Failure 
> authenticating with NTLM <any realm>@server.domain.com:80
> 2009-03-31 10:07:04,070 DEBUG httpclient.Http - url: 
> http://server.domain.com/secured; status code: 401; bytes received: 24; 
> Content-Length: 24
> 2009-03-31 10:07:04,117 DEBUG httpclient.Http - 401 Authentication Required
>
> -----Original Message-----
> From: Susam Pal [mailto:[email protected]]
> Sent: Tuesday, March 31, 2009 9:58 AM
> To: [email protected]
> Subject: Re: Nutch 1.0 - NTLM question
>
> On Tue, Mar 31, 2009 at 9:03 PM, Austin, David <[email protected]> 
> wrote:
>> Got Nutch 1.0 setup fairly easily and even did a couple crawls. Very
>> pleased with the results so far. However, now I am trying to get the
>> NTLM portion to work.
>>
>> Following the instructions here:
>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>> <http://wiki.apache.org/nutch/HttpAuthenticationSchemes>
>>
>> My httpclient-auth.xml looks as follows:
>>
>> <auth-configuration>
>>  <credentials username="user" password="pass">
>>    <authscope host="server.domain.com" port="80" realm="domain.com"
>> scheme="NTLM"/>
>>  </credentials>
>> </auth-configuration>
>
> Hi David,
>
> For troubleshooting, I would suggest that you start with the simplest
> configuration for authentication. The simplest configuration contains
> only the default authentication scope.
>
> <credentials username="susam" password="masus">
>  <default/>
> </credentials>
>
> This is discussed in
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes in section
> "Crawling an Intranet with Default Authentication Scope". If this
> doesn't work fine, please go to "Need Help?" section in the same wiki
> article and follow the checklist and send us the relevant log files.
> If this goes fine, probably the authentication scope is not configured
> properly. You could ensure that the server indeed requires NTLM
> authentication and not Basic or Digest authentication. The realm value
> is another thing that could go wrong.
>
>>
>> Is this the correct setup for NTLM?  At present I'm only receiving 401's
>> so it doesn't appear to be working in this setup.  Basic auth would look
>> like "domain\user" if we were to login that way in case you're curious.
>
> I doubt that the value you have put in realm is correct. If you visit
> the page you are trying to crawl using a browser, what credentials do
> you enter? If you enter "domain\user" as the user name then only
> "domain" should go as the value of realm. However, before configuring
> authentication scope for NTLM scheme, I would suggest that you first
> get the default authentication scope working and then proceed with the
> configuration for NTLM authentication scheme.
>
>>
>> I noticed that for 0.9 there were properties that had to be setup in
>> nutch-site.xml; is that still the case?  Refering to this link:
>> http://www.mail-archive.com/[email protected]/msg02102.htm
>> l
>> Here it looks like several http.auth.username and http.auth.password
>> have to be set.  Based on what I read though that's not needed anymore
>> in 1.0 based upon [NUTCH-559:
>> https://issues.apache.org/jira/browse/NUTCH-559], correct?
>
> Yes, you are right. http.auth.username and http.auth.password are not
> required. They were present during the development of this feature but
> they were removed as the development progressed.
>
> Regards,
> Susam Pal
>
> This email communication and any files transmitted with it may contain 
> confidential and or proprietary information and is provided for the use of 
> the intended recipient only.  Any review, retransmission or dissemination of 
> this information by anyone other than the intended recipient is prohibited.  
> If you receive this email in error, please contact the sender and delete this 
> communication and any copies immediately.  Thank you.
> http://www.encana.com
>
>

Reply via email to