Any thoughts? I've heard that there is a bug in apache httpclient that makes Negotiate authentication not work, but even if that is fixed I'm not quite clear on how to configure the httpclient-auth.xml file. Can someone point me in the right direction?
Thanks, Eric > -----Original Message----- > From: Eric Haszlakiewicz [mailto:[email protected]] > Sent: Tuesday, May 27, 2014 4:53 PM > To: '[email protected]' > Subject: using kerberos with nutch > > I was able to follow the Nutch tutorial and get the bin/crawl command > working with sites that don't require authentication, including loading the > results into a Solr installation. I also checked that I could query the Solr > index > and get back the expected information. > > However, I can't figure out how to get it to use Kerberos authentication to > fetch urls. > I'm using apache-nutch-1.8, which appears to have the necessary version of > Apache HttpClient (httpclient-4.1.1.jar). > > Here's what I see: > > ./bin/nutch org.apache.nutch.parse.ParserChecker > https://myhost.example.com > fetching: https://myhost.example.com > Fetch failed with protocol status: access_denied(17), lastModified=0: > Authentication required: https://myhost.example.com > > > In logs/hadoop.log: > 2014-05-27 20:35:53,866 INFO parse.ParserChecker - fetching: > https://myhost.example.com > 2014-05-27 20:35:54,071 ERROR protocol.RobotRulesParser - Agent we > advertise (My Nutch Spider) not listed first in 'http.robots.agents' property! > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.proxy.host = null > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.proxy.port = 8080 > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.timeout = 10000 > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.content.limit = 65536 > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.agent = My Nutch > Spider/Nutch-1.8 > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.accept.language = en- > us,en-gb,en;q=0.7,*;q=0.3 > 2014-05-27 20:35:54,071 INFO httpclient.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2014-05-27 20:35:54,651 WARN httpclient.HttpMethodDirector - Unable to > respond to any of these challenges: {negotiate=Negotiate} > > I enabled protocol-httpclient in conf/nutch-default.xml. I expect I need to > put something in conf/httpclient-auth.xml, but I can't figure out what. I > found the http://wiki.apache.org/nutch/HttpAuthenticationSchemes page, > but all the examples there seem to assume that credentials consist of a > username and password, which is of course not the case with Kerberos. > How do I tell Nutch to use Negotiate authentication? > > Thanks, > Eric

