I was able to follow the Nutch tutorial and get the bin/crawl command working 
with sites that don't require authentication, including loading the results 
into a Solr installation.  I also checked that I could query the Solr index and 
get back the expected information.

However, I can't figure out how to get it to use Kerberos authentication to 
fetch urls.
I'm using apache-nutch-1.8, which appears to have the necessary version of 
Apache HttpClient (httpclient-4.1.1.jar).

Here's what I see:

./bin/nutch org.apache.nutch.parse.ParserChecker https://myhost.example.com
fetching: https://myhost.example.com
Fetch failed with protocol status: access_denied(17), lastModified=0: 
Authentication required: https://myhost.example.com


In logs/hadoop.log:
2014-05-27 20:35:53,866 INFO  parse.ParserChecker - fetching: 
https://myhost.example.com
2014-05-27 20:35:54,071 ERROR protocol.RobotRulesParser - Agent we advertise 
(My Nutch Spider) not listed first in 'http.robots.agents' property!
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.proxy.host = null
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.proxy.port = 8080
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.timeout = 10000
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.content.limit = 65536
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.agent = My Nutch 
Spider/Nutch-1.8
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2014-05-27 20:35:54,651 WARN  httpclient.HttpMethodDirector - Unable to respond 
to any of these challenges: {negotiate=Negotiate}

I enabled protocol-httpclient in conf/nutch-default.xml.  I expect I need to 
put something in conf/httpclient-auth.xml, but I can't figure out what.  I 
found the http://wiki.apache.org/nutch/HttpAuthenticationSchemes page, but all 
the examples there seem to assume that credentials consist of a username and 
password, which is of course not the case with Kerberos.
How do I tell Nutch to use Negotiate authentication?

Thanks,
Eric

Reply via email to