On Thu, May 14, 2009 at 6:01 PM, Rochelle D'souza <[email protected]> wrote: > hi Susam, i hope i am not troubling you by mailing you directly. > > Its just that i have not yet received a reply to my mail, and i desperately > am trying to resolve this issue I am facing.
I did not receive your mail via the nutch-user mailing list. Have you subscribed to the list? I am not sure why I didn't receive your email that you posted to the list. > i also tried setting the below properties > > <property> > <name>http.agent.host</name> > <value>pc0043XX.xyz.com</value> > </property> > > > > And > > > > <credentials username="devadmin" password="pass**"> > <authscope host="pc0043XX.xyz.com" port="80"/> > </credentials> > I don't understand how this is helpful, since your site host name is 'googly' and not 'pc0043XX.xyz.com'. >> >> The complete code of httpclient-auth.xml is >> >> <auth-configuration> >> >> >> >> <credentials username="132671" password="abc-1"> >> >> <default/> >> >> </credentials> >> >> <credentials username="devadmin" password="def-1"> >> >> <authscope host="10.230.35.135" port="8080" realm="xyz" >> scheme="NTLM"/> >> >> </credentials> >> >> >> </auth-configuration> >> >> >> >> 132671 and devadmin are 2 user ids in the network having access to the >> site http://googly. >> >> The host ip is my machine ip on the LAN. >> >> Port is the port from which apache runs. >> >> The realm I understood to be my domain. Please let me know if this is >> correct. >> >> The scheme, I set it as NTLM because the site has IWA. >> >> >> >> The log extract is below: >> >> http.agent = POCSpider/Nutch-1.0 >> >> protocol.plugin.check.blocking = false >> >> protocol.plugin.check.robots = false >> >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> >> Credentials - username: 132671; set as default for realm: ; scheme: >> >> Credentials - username: devadmin; set for AuthScope - host: 10.230.35.135; >> po >> >> rt: 8080; realm: xyz; scheme: NTLM >> >> Pre-configured credentials with scope - host: googly; port: 80; not found >> for u >> >> rl: http://googly/robots.txt >> >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> >> url: http://googly/robots.txt; status code: 401; bytes received: 1539; >> Content-L >> >> ength: 1539 >> >> Pre-configured credentials with scope - host: googly; port: 80; found for >> url: h >> >> ttp://googly/ The above line confirms that credentials for 'googly' was picked up from one of the non-default authscopes. >> >> url: http://googly/; status code: 401; bytes received: 1539; >> Content-Length: 153 However, the authentication does not succeed. The only thing I can imagine is that there is some problem at your end. Either, the website is not requesting for NTLM authentication or authentication is not properly configured at the server. The configuration file you have given doesn't help me to understand where exactly you have configured the credentials for http://googly/ ? The port number for 10.230.35.135 is provided as 8080 in the configuration file. However, you are trying to crawl http://googly/ which is running on port 80. But then, the logs tell us that default configuration is not being used. So, the information you have provided so far doesn't help me reach any conclusion. It would be great if you could delete your current log files. Make a very simple configuration with only default auth scope with some username and password configured that you know for sure can access http://googly/, perform a fresh crawl only for this site (so remove other URLs in the seed URLs file), and attach the complete 'httpclient-auth.xml' and log file in your mail. You might also want to go through the checklist in "Need Help?" section of this wiki article : http://wiki.apache.org/nutch/HttpAuthenticationSchemes Regards, Susam Pal
