Hello everybody,


I am behind a proxy server proxy.prod.anet.xxx.yy which needs my uid and
pwd to visit any internet website via firefox each time when firefox is
opened.

There is also an intranet, and each time when I access it, I have to
provide my uid and pwd the first time to connect to the intranet.



I am building a crawler by using nutch 1.8 to crawl both internet and
intranet pages. Java7 is used in the machine.



parsechecker is used as the first step to test the settings of nutch.
Following is the parameter settings and output of parsechecker. Log
properties are set as “DEBUG”.



I followed the instructions in the wiki but have difficult to make nutch
working, would you kindly please help.



1.      Crawl internet webpages without credential settings



In addition to *.agent.name and *.robots.agents, only http.proxy.*
parameters in nutch-site.xml are set. Following are the content of
nutch-site.xml, parsechecker command, output and hadoop.log.



*Nutch-site.xml*

<configuration>



<property>

  <name>http.agent.name</name>

  <value>NutchSpider</value>

</property>



<property>

  <name>http.robots.agents</name>

  <value>NutchSpider,*</value>

</property>



<property>

  <name>http.proxy.host</name>

  <value>proxy.prod.anet.xxx.yy</value>

</property>



<property>

  <name>http.proxy.port</name>

  <value>8080</value>

</property>



<property>

  <name>http.proxy.username</name>

  <value>myuid</value>

</property>



<property>

  <name>http.proxy.password</name>

  <value>mypwd</value>

</property>



</configuration>



        *Command and output*



*bin/nutch parsechecker http://www.abc.net.au/ <http://www.abc.net.au/>*



fetching: http://www.abc.net.au/

http.proxy.host = proxy.prod.anet.xxx.yy

http.proxy.port = 8080

http.timeout = 10000

http.content.limit = 65536

http.agent = NutchSpider/Nutch-1.8

http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3

http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Fetch failed with protocol status: exception(16), lastModified=0: Http
code=407, url=http://www.abc.net.au/



        *Log*

2014-06-04 11:45:20,880 INFO  parse.ParserChecker - fetching:
http://www.abc.net.au/

2014-06-04 11:45:21,360 INFO  http.Http - http.proxy.host =
proxy.prod.anet.xxx.yy

2014-06-04 11:45:21,360 INFO  http.Http - http.proxy.port = 8080

2014-06-04 11:45:21,361 INFO  http.Http - http.timeout = 10000

2014-06-04 11:45:21,361 INFO  http.Http - http.content.limit = 65536

2014-06-04 11:45:21,361 INFO  http.Http - http.agent = NutchSpider/Nutch-1.8

2014-06-04 11:45:21,361 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3

2014-06-04 11:45:21,361 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



2.      Crawl internet webpages with credential settings



Then, add the http.auth.file property, and set the httpclient-auth.xml



*Nutch-site.xml*

 <configuration>



<property>

  <name>http.agent.name</name>

  <value>NutchSpider</value>

</property>



<property>

  <name>http.robots.agents</name>

  <value>NutchSpider,*</value>

</property>



<property>

  <name>http.agent.host</name>

  <value> xyz123</value>

</property>



<property>

<name>plugin.includes</name>

<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|


scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr|index-more</value>

</property>



<property>

  <name>http.auth.file</name>

  <value>httpclient-auth.xml</value>

</property>



<property>

  <name>http.proxy.host</name>

  <value>proxy.prod.anet.xxx.yy</value>

</property>



<property>

  <name>http.proxy.port</name>

  <value>8080</value>

</property>



<property>

  <name>http.proxy.username</name>

  <value>myuid</value>

</property>



<property>

  <name>http.proxy.password</name>

  <value>mypwd</value>

</property>



<property>

  <name>http.proxy.realm</name>

  <value>anet</value>

</configuration>



*Httpclient-auth.xml*

 <credentials username="myid" password="mypwd">

    <authscope host="proxy.prod.anet.xxx.yy" port="8080" realm="anet"
scheme="NTLM"/>

</credentials>



<credentials username="myid" password="mypwd">

    <default realm="anet"/>

</credentials>

 *Command and output*

 *bin/nutch parsechecker http://www.abc.net.au/ <http://www.abc.net.au/>*

fetching: http://www.abc.net.au/

http.proxy.host = proxy.prod.anet.xxx.yy

http.proxy.port = 8080

http.timeout = 10000

http.content.limit = 65536

http.agent = NutchSpider/Nutch-1.8

http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3

http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Supported authentication schemes in the order of preference: [ntlm, digest,
basic]

ntlm authentication scheme selected

Using authentication scheme: ntlm

Authorization challenge processed

Using authentication scheme: ntlm

Authorization challenge processed

Using authentication scheme: ntlm

Authorization challenge processed

Fetch failed with protocol status: exception(16), lastModified=0: Http
code=407, url=http://www.abc.net.au/



*Log*

2014-06-04 14:34:50,162 INFO  parse.ParserChecker - fetching:
http://www.abc.net.au/

2014-06-04 14:34:50,438 INFO  httpclient.Http - http.proxy.host =
proxy.prod.anet.xxx.yy

2014-06-04 14:34:50,439 INFO  httpclient.Http - http.proxy.port = 8080

2014-06-04 14:34:50,439 INFO  httpclient.Http - http.timeout = 10000

2014-06-04 14:34:50,439 INFO  httpclient.Http - http.content.limit = 65536

2014-06-04 14:34:50,439 INFO  httpclient.Http - http.agent =
NutchSpider/Nutch-1.8

2014-06-04 14:34:50,439 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3

2014-06-04 14:34:50,439 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

2014-06-04 14:34:50,623 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]

2014-06-04 14:34:50,624 INFO  auth.AuthChallengeProcessor - ntlm
authentication scheme selected

2014-06-04 14:34:50,624 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm

2014-06-04 14:34:50,624 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed

2014-06-04 14:34:50,672 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm

2014-06-04 14:34:50,672 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed

2014-06-04 14:34:50,784 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm

2014-06-04 14:34:50,784 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed

2014-06-04 14:34:50,784 INFO  httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@proxy.prod.anet.xxx.yy:8080



So far, the error message “Fetch failed with protocol status:
exception(16), lastModified=0: Http code=407” always occurs.




Thanks in advance for your help



Simon

Reply via email to