I tried to run Nutch 0.9 from my network, which require HTTP proxy access. I 
have set up http.proxy.host and http.proxy.port properties in my 
nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can see 
it in log (see below). But still I get java.net.UnknownHostException.

Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really tries to 
use proxy. And there is request from Nutch to proxy to get robots.txt. It says 
"404 Not Found". There is no fallowing request for particular page, only for 
robots.txt.

Any ideas what is wrong?
Marcin Okraszewski

007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port = <my_proxy_port>
2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
2007-05-15 17:38:59,465 INFO  http.Http - http.agent = 
YetAnotherSearchEngine/Nutch-0.9
2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking = true
2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots = true
2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
2007-05-15 17:38:59,832 ERROR http.Http - 
org.apache.nutch.protocol.http.api.HttpException: 
java.net.UnknownHostException: <crawl_site>: <crawl_site>
2007-05-15 17:38:59,832 ERROR http.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
2007-05-15 17:38:59,832 ERROR http.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212)
2007-05-15 17:38:59,832 ERROR http.Http - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
2007-05-15 17:38:59,832 ERROR http.Http - Caused by: 
java.net.UnknownHostException: www.gral.pl: www.gral.pl
2007-05-15 17:38:59,832 ERROR http.Http - at 
java.net.InetAddress.getAllByName0(InetAddress.java:1128)
2007-05-15 17:38:59,833 ERROR http.Http - at 
java.net.InetAddress.getAllByName0(InetAddress.java:1098)
2007-05-15 17:38:59,833 ERROR http.Http - at 
java.net.InetAddress.getAllByName(InetAddress.java:1061)
2007-05-15 17:38:59,833 ERROR http.Http - at 
java.net.InetAddress.getByName(InetAddress.java:958)
2007-05-15 17:38:59,833 ERROR http.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site> failed 
with: org.apache.nutch.protocol.http.api.HttpException: 
java.net.UnknownHostException: <crawl_site>: <crawl_site>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to