Marcin Okraszewski wrote:

>I tried to run Nutch 0.9 from my network, which require HTTP proxy access. I 
>have set up http.proxy.host and http.proxy.port properties in my 
>nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can see 
>it in log (see below). But still I get java.net.UnknownHostException.
>
>Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really tries 
>to use proxy. And there is request from Nutch to proxy to get robots.txt. It 
>says "404 Not Found". There is no fallowing request for particular page, only 
>for robots.txt.
>
>Any ideas what is wrong?
>  
>

IIRC we had to patch Nutch in order to make it work with a proxy, but 
that is Nutch 0.8 and I don't have this code available right now, but 
you might want to search JIRA for possible patches. Whereas actually it 
seems like something has been done

http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt

issues 21

HTH

Michael

>Marcin Okraszewski
>
>007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
>2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port = <my_proxy_port>
>2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
>2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
>2007-05-15 17:38:59,465 INFO  http.Http - http.agent = 
>YetAnotherSearchEngine/Nutch-0.9
>2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking = true
>2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots = true
>2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
>2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
>2007-05-15 17:38:59,832 ERROR http.Http - 
>org.apache.nutch.protocol.http.api.HttpException: 
>java.net.UnknownHostException: <crawl_site>: <crawl_site>
>2007-05-15 17:38:59,832 ERROR http.Http - at 
>org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
>2007-05-15 17:38:59,832 ERROR http.Http - at 
>org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212)
>2007-05-15 17:38:59,832 ERROR http.Http - at 
>org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
>2007-05-15 17:38:59,832 ERROR http.Http - Caused by: 
>java.net.UnknownHostException: www.gral.pl: www.gral.pl
>2007-05-15 17:38:59,832 ERROR http.Http - at 
>java.net.InetAddress.getAllByName0(InetAddress.java:1128)
>2007-05-15 17:38:59,833 ERROR http.Http - at 
>java.net.InetAddress.getAllByName0(InetAddress.java:1098)
>2007-05-15 17:38:59,833 ERROR http.Http - at 
>java.net.InetAddress.getAllByName(InetAddress.java:1061)
>2007-05-15 17:38:59,833 ERROR http.Http - at 
>java.net.InetAddress.getByName(InetAddress.java:958)
>2007-05-15 17:38:59,833 ERROR http.Http - at 
>org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
>2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
>2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site> failed 
>with: org.apache.nutch.protocol.http.api.HttpException: 
>java.net.UnknownHostException: <crawl_site>: <crawl_site>
>
>
>  
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to