Marcin Okraszewski wrote: >I tried to run Nutch 0.9 from my network, which require HTTP proxy access. I >have set up http.proxy.host and http.proxy.port properties in my >nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can see >it in log (see below). But still I get java.net.UnknownHostException. > >Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really tries >to use proxy. And there is request from Nutch to proxy to get robots.txt. It >says "404 Not Found". There is no fallowing request for particular page, only >for robots.txt. > >Any ideas what is wrong? > >
IIRC we had to patch Nutch in order to make it work with a proxy, but that is Nutch 0.8 and I don't have this code available right now, but you might want to search JIRA for possible patches. Whereas actually it seems like something has been done http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt issues 21 HTH Michael >Marcin Okraszewski > >007-05-15 17:38:59,465 INFO http.Http - http.proxy.host = <my_proxy_host> >2007-05-15 17:38:59,465 INFO http.Http - http.proxy.port = <my_proxy_port> >2007-05-15 17:38:59,465 INFO http.Http - http.timeout = 10000 >2007-05-15 17:38:59,465 INFO http.Http - http.content.limit = 65536 >2007-05-15 17:38:59,465 INFO http.Http - http.agent = >YetAnotherSearchEngine/Nutch-0.9 >2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.blocking = true >2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.robots = true >2007-05-15 17:38:59,466 INFO http.Http - fetcher.server.delay = 100 >2007-05-15 17:38:59,466 INFO http.Http - http.max.delays = 100 >2007-05-15 17:38:59,832 ERROR http.Http - >org.apache.nutch.protocol.http.api.HttpException: >java.net.UnknownHostException: <crawl_site>: <crawl_site> >2007-05-15 17:38:59,832 ERROR http.Http - at >org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340) >2007-05-15 17:38:59,832 ERROR http.Http - at >org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212) >2007-05-15 17:38:59,832 ERROR http.Http - at >org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) >2007-05-15 17:38:59,832 ERROR http.Http - Caused by: >java.net.UnknownHostException: www.gral.pl: www.gral.pl >2007-05-15 17:38:59,832 ERROR http.Http - at >java.net.InetAddress.getAllByName0(InetAddress.java:1128) >2007-05-15 17:38:59,833 ERROR http.Http - at >java.net.InetAddress.getAllByName0(InetAddress.java:1098) >2007-05-15 17:38:59,833 ERROR http.Http - at >java.net.InetAddress.getAllByName(InetAddress.java:1061) >2007-05-15 17:38:59,833 ERROR http.Http - at >java.net.InetAddress.getByName(InetAddress.java:958) >2007-05-15 17:38:59,833 ERROR http.Http - at >org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336) >2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more >2007-05-15 17:38:59,834 INFO fetcher.Fetcher - fetch of <crawl_site> failed >with: org.apache.nutch.protocol.http.api.HttpException: >java.net.UnknownHostException: <crawl_site>: <crawl_site> > > > > -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED] [EMAIL PROTECTED] +41 44 272 91 61 ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
