Thanks a lot!! That was exactly it - the fetcher.threads.per.host.by.ip property. As my network is isolated from Internet DNS, fetcher couldn't resolve the name, so group by IP. Turning the poperty to false reloved the problem. I didn't think of it.
Thanks a lot for help. Marcin > I had the same issue. > > You need to use a tool like http://java-ntlm-proxy.sourceforge.net/ to > bypass the proxy. > You will have to edit the configuration file to add your proxy server > hostname, port, login and pwd. > > Then you need to configure you nucth process to point to this process. You > shoudl add the following in nutch-site.xml > <property> > <name>http.proxy.host</name> > <value>hostname of the machine where is located the NTLMProxy</value> > <description>The proxy hostname. If empty, no proxy is > used.</description> > </property> > > <property> > <name>http.proxy.port</name> > <value>port of the NTLMProxy process </value> > <description>The proxy port.</description> > </property> > > I suggest also to add this property to avoid any conflict of reolution of > hostname: > <property> > <name>fetcher.threads.per.host.by.ip</name> > <value>false</value> > <description>ssssssssss.</description> > </property> > > Hope it will help you > > > > I tried to run Nutch 0.9 from my network, which require HTTP proxy access. > > I have set up http.proxy.host and http.proxy.port properties in my > > nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can > > see it in log (see below). But still I get java.net.UnknownHostException. > > > > Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really > > tries to use proxy. And there is request from Nutch to proxy to get > > robots.txt. It says "404 Not Found". There is no fallowing request for > > particular page, only for robots.txt. > > > > Any ideas what is wrong? > > Marcin Okraszewski > > > > 007-05-15 17:38:59,465 INFO http.Http - http.proxy.host = <my_proxy_host> > > 2007-05-15 17:38:59,465 INFO http.Http - http.proxy.port = > > <my_proxy_port> > > 2007-05-15 17:38:59,465 INFO http.Http - http.timeout = 10000 > > 2007-05-15 17:38:59,465 INFO http.Http - http.content.limit = 65536 > > 2007-05-15 17:38:59,465 INFO http.Http - http.agent = > > YetAnotherSearchEngine/Nutch-0.9 > > 2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.blocking = > > true > > 2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.robots = > > true > > 2007-05-15 17:38:59,466 INFO http.Http - fetcher.server.delay = 100 > > 2007-05-15 17:38:59,466 INFO http.Http - http.max.delays = 100 > > 2007-05-15 17:38:59,832 ERROR http.Http - > > org.apache.nutch.protocol.http.api.HttpException: > > java.net.UnknownHostException: <crawl_site>: <crawl_site> > > 2007-05-15 17:38:59,832 ERROR http.Http - at > > org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340) > > 2007-05-15 17:38:59,832 ERROR http.Http - at > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput( > HttpBase.java:212) > > 2007-05-15 17:38:59,832 ERROR http.Http - at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) > > 2007-05-15 17:38:59,832 ERROR http.Http - Caused by: > > java.net.UnknownHostException: www.gral.pl: www.gral.pl > > 2007-05-15 17:38:59,832 ERROR http.Http - at > > java.net.InetAddress.getAllByName0(InetAddress.java:1128) > > 2007-05-15 17:38:59,833 ERROR http.Http - at > > java.net.InetAddress.getAllByName0(InetAddress.java:1098) > > 2007-05-15 17:38:59,833 ERROR http.Http - at > > java.net.InetAddress.getAllByName(InetAddress.java:1061) > > 2007-05-15 17:38:59,833 ERROR http.Http - at > > java.net.InetAddress.getByName(InetAddress.java:958) > > 2007-05-15 17:38:59,833 ERROR http.Http - at > > org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336) > > 2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more > > 2007-05-15 17:38:59,834 INFO fetcher.Fetcher - fetch of <crawl_site> > > failed with: org.apache.nutch.protocol.http.api.HttpException: > > java.net.UnknownHostException: <crawl_site>: <crawl_site> > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
