hi Susam,

Many thanks for your reply.



As requested I have given only default authentication. Below is the
httpclient-auth.xml

<?xml version="1.0"?>

<auth-configuration>

      <credentials username="devadmin" password="password">

      </credentials>

</auth-configuration>

The logs for the same are

I have only masked the agent and proxy host name since I am sharing the log
file.



Then I changed the httpclient-auth.xml to the below code

<?xml version="1.0"?>

<auth-configuration>

      <credentials username="devadmin" password="password">

            <authscope host="googly" port="80" realm="xyz"/>

      </credentials>

</auth-configuration>

Here I have masked the realm name to xyz. I have attached the log file for
the same. in the logs i have masked the agent, proxy host and realm name.

In both the above cases I am not able to authenticate to the site.

Also xyz\devadmin is the user name when I try to login to the site
http://googly using IE.




On Thu, May 14, 2009 at 7:32 PM, Susam Pal <[email protected]> wrote:

> On Thu, May 14, 2009 at 6:01 PM, Rochelle D'souza
> <[email protected]> wrote:
> > hi Susam, i hope i am not troubling you by mailing you directly.
> >
> > Its just that i have not yet received a reply to my mail, and i
> desperately
> > am trying to resolve this issue I am facing.
>
> I did not receive your mail via the nutch-user mailing list. Have you
> subscribed to the list? I am not sure why I didn't receive your email
> that you posted to the list.
>
> > i also tried setting the below properties
> >
> > <property>
> >   <name>http.agent.host</name>
> >   <value>pc0043XX.xyz.com <http://pc0043xx.xyz.com/></value>
> > </property>
> >
> >
> >
> > And
> >
> >
> >
> > <credentials username="devadmin" password="pass**">
> >    <authscope host="pc0043XX.xyz.com <http://pc0043xx.xyz.com/>"
> port="80"/>
> >  </credentials>
> >
>
> I don't understand how this is helpful, since your site host name is
> 'googly' and not 'pc0043XX.xyz.com <http://pc0043xx.xyz.com/>'.
>
> >>
> >> The complete code of httpclient-auth.xml is
> >>
> >> <auth-configuration>
> >>
> >>
> >>
> >>       <credentials username="132671" password="abc-1">
> >>
> >>             <default/>
> >>
> >>       </credentials>
> >>
> >>       <credentials username="devadmin" password="def-1">
> >>
> >>             <authscope host="10.230.35.135" port="8080" realm="xyz"
> >> scheme="NTLM"/>
> >>
> >>       </credentials>
> >>
> >>
> >> </auth-configuration>
> >>
> >>
> >>
> >> 132671 and devadmin are 2 user ids in the network having access to the
> >> site http://googly.
> >>
> >> The host ip is my machine ip on the LAN.
> >>
> >> Port is the port from which apache runs.
> >>
> >> The realm I understood to be my domain. Please let me know if this is
> >> correct.
> >>
> >> The scheme, I set it as NTLM because the site has IWA.
> >>
> >>
> >>
> >> The log extract is below:
> >>
> >> http.agent = POCSpider/Nutch-1.0
> >>
> >> protocol.plugin.check.blocking = false
> >>
> >> protocol.plugin.check.robots = false
> >>
> >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >>
> >> Credentials - username: 132671; set as default for realm: ; scheme:
> >>
> >> Credentials - username: devadmin; set for AuthScope - host:
> 10.230.35.135;
> >> po
> >>
> >> rt: 8080; realm: xyz; scheme: NTLM
> >>
> >> Pre-configured credentials with scope -  host: googly; port: 80; not
> found
> >> for u
> >>
> >> rl: http://googly/robots.txt
> >>
> >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >>
> >> url: http://googly/robots.txt; status code: 401; bytes received: 1539;
> >> Content-L
> >>
> >> ength: 1539
> >>
> >> Pre-configured credentials with scope - host: googly; port: 80; found
> for
> >> url: h
> >>
> >> ttp://googly/
>
> The above line confirms that credentials for 'googly' was picked up
> from one of the non-default authscopes.
>
> >>
> >> url: http://googly/; status code: 401; bytes received: 1539;
> >> Content-Length: 153
>
> However, the authentication does not succeed. The only thing I can
> imagine is that there is some problem at your end. Either, the website
> is not requesting for NTLM authentication or authentication is not
> properly configured at the server.
>
> The configuration file you have given doesn't help me to understand
> where exactly you have configured the credentials for http://googly/ ?
> The port number for 10.230.35.135 is provided as 8080 in the
> configuration file. However, you are trying to crawl http://googly/
> which is running on port 80.
>
> But then, the logs tell us that default configuration is not being
> used. So, the information you have provided so far doesn't help me
> reach any conclusion.
>
> It would be great if you could delete your current log files. Make a
> very simple configuration with only default auth scope with some
> username and password configured that you know for sure can access
> http://googly/, perform a fresh crawl only for this site (so remove
> other URLs in the seed URLs file), and attach the complete
> 'httpclient-auth.xml' and log file in your mail.
>
> You might also want to go through the checklist in "Need Help?"
> section of this wiki article :
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>
> Regards,
> Susam Pal
>

Reply via email to