Hello everybody,
I am behind a proxy server proxy.prod.anet.xxx.yy which needs my uid and
pwd to visit any internet website via firefox each time when firefox is
opened.
There is also an intranet, and each time when I access it, I have to
provide my uid and pwd the first time to connect to the intranet.
I am building a crawler by using nutch 1.8 to crawl both internet and
intranet pages. Java7 is used in the machine.
parsechecker is used as the first step to test the settings of nutch.
Following is the parameter settings and output of parsechecker. Log
properties are set as “DEBUG”.
I followed the instructions in the wiki but have difficult to make nutch
working, would you kindly please help.
1. Crawl internet webpages without credential settings
In addition to *.agent.name and *.robots.agents, only http.proxy.*
parameters in nutch-site.xml are set. Following are the content of
nutch-site.xml, parsechecker command, output and hadoop.log.
*Nutch-site.xml*
<configuration>
<property>
<name>http.agent.name</name>
<value>NutchSpider</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchSpider,*</value>
</property>
<property>
<name>http.proxy.host</name>
<value>proxy.prod.anet.xxx.yy</value>
</property>
<property>
<name>http.proxy.port</name>
<value>8080</value>
</property>
<property>
<name>http.proxy.username</name>
<value>myuid</value>
</property>
<property>
<name>http.proxy.password</name>
<value>mypwd</value>
</property>
</configuration>
*Command and output*
*bin/nutch parsechecker http://www.abc.net.au/ <http://www.abc.net.au/>*
fetching: http://www.abc.net.au/
http.proxy.host = proxy.prod.anet.xxx.yy
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = 65536
http.agent = NutchSpider/Nutch-1.8
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Fetch failed with protocol status: exception(16), lastModified=0: Http
code=407, url=http://www.abc.net.au/
*Log*
2014-06-04 11:45:20,880 INFO parse.ParserChecker - fetching:
http://www.abc.net.au/
2014-06-04 11:45:21,360 INFO http.Http - http.proxy.host =
proxy.prod.anet.xxx.yy
2014-06-04 11:45:21,360 INFO http.Http - http.proxy.port = 8080
2014-06-04 11:45:21,361 INFO http.Http - http.timeout = 10000
2014-06-04 11:45:21,361 INFO http.Http - http.content.limit = 65536
2014-06-04 11:45:21,361 INFO http.Http - http.agent = NutchSpider/Nutch-1.8
2014-06-04 11:45:21,361 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2014-06-04 11:45:21,361 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2. Crawl internet webpages with credential settings
Then, add the http.auth.file property, and set the httpclient-auth.xml
*Nutch-site.xml*
<configuration>
<property>
<name>http.agent.name</name>
<value>NutchSpider</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchSpider,*</value>
</property>
<property>
<name>http.agent.host</name>
<value> xyz123</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr|index-more</value>
</property>
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
</property>
<property>
<name>http.proxy.host</name>
<value>proxy.prod.anet.xxx.yy</value>
</property>
<property>
<name>http.proxy.port</name>
<value>8080</value>
</property>
<property>
<name>http.proxy.username</name>
<value>myuid</value>
</property>
<property>
<name>http.proxy.password</name>
<value>mypwd</value>
</property>
<property>
<name>http.proxy.realm</name>
<value>anet</value>
</configuration>
*Httpclient-auth.xml*
<credentials username="myid" password="mypwd">
<authscope host="proxy.prod.anet.xxx.yy" port="8080" realm="anet"
scheme="NTLM"/>
</credentials>
<credentials username="myid" password="mypwd">
<default realm="anet"/>
</credentials>
*Command and output*
*bin/nutch parsechecker http://www.abc.net.au/ <http://www.abc.net.au/>*
fetching: http://www.abc.net.au/
http.proxy.host = proxy.prod.anet.xxx.yy
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = 65536
http.agent = NutchSpider/Nutch-1.8
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Supported authentication schemes in the order of preference: [ntlm, digest,
basic]
ntlm authentication scheme selected
Using authentication scheme: ntlm
Authorization challenge processed
Using authentication scheme: ntlm
Authorization challenge processed
Using authentication scheme: ntlm
Authorization challenge processed
Fetch failed with protocol status: exception(16), lastModified=0: Http
code=407, url=http://www.abc.net.au/
*Log*
2014-06-04 14:34:50,162 INFO parse.ParserChecker - fetching:
http://www.abc.net.au/
2014-06-04 14:34:50,438 INFO httpclient.Http - http.proxy.host =
proxy.prod.anet.xxx.yy
2014-06-04 14:34:50,439 INFO httpclient.Http - http.proxy.port = 8080
2014-06-04 14:34:50,439 INFO httpclient.Http - http.timeout = 10000
2014-06-04 14:34:50,439 INFO httpclient.Http - http.content.limit = 65536
2014-06-04 14:34:50,439 INFO httpclient.Http - http.agent =
NutchSpider/Nutch-1.8
2014-06-04 14:34:50,439 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2014-06-04 14:34:50,439 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2014-06-04 14:34:50,623 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2014-06-04 14:34:50,624 INFO auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2014-06-04 14:34:50,624 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2014-06-04 14:34:50,624 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2014-06-04 14:34:50,672 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2014-06-04 14:34:50,672 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2014-06-04 14:34:50,784 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2014-06-04 14:34:50,784 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2014-06-04 14:34:50,784 INFO httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@proxy.prod.anet.xxx.yy:8080
So far, the error message “Fetch failed with protocol status:
exception(16), lastModified=0: Http code=407” always occurs.
Thanks in advance for your help
Simon