Thanks a lot!! That was exactly it - the fetcher.threads.per.host.by.ip 
property. As my network is isolated from Internet DNS, fetcher couldn't resolve 
the name, so group by IP. Turning the poperty to false reloved the problem. I 
didn't think of it.

Thanks a lot for help.
Marcin 

 
> I had the same issue.
> 
> You need to use a tool like http://java-ntlm-proxy.sourceforge.net/ to
> bypass the proxy.
> You will have to edit the configuration file to add your proxy server
> hostname, port, login and pwd.
> 
> Then  you need to configure you nucth process to point to this process. You
> shoudl add the following in nutch-site.xml
> <property>
>   <name>http.proxy.host</name>
>   <value>hostname of the machine where is located the NTLMProxy</value>
>   <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
> 
> <property>
>   <name>http.proxy.port</name>
>   <value>port of the NTLMProxy process </value>
>   <description>The proxy port.</description>
> </property>
> 
> I suggest also to add this property to avoid any conflict of reolution of
> hostname:
>  <property>
>    <name>fetcher.threads.per.host.by.ip</name>
>    <value>false</value>
>    <description>ssssssssss.</description>
> </property>
> 
> Hope it will help you
> 
> >
>  I tried to run Nutch 0.9 from my network, which require HTTP proxy access.
> > I have set up http.proxy.host and http.proxy.port properties in my
> > nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can
> > see it in log (see below). But still I get java.net.UnknownHostException.
> >
> > Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really
> > tries to use proxy. And there is request from Nutch to proxy to get
> > robots.txt. It says "404 Not Found". There is no fallowing request for
> > particular page, only for robots.txt.
> >
> > Any ideas what is wrong?
> > Marcin Okraszewski
> >
> > 007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port =
> > <my_proxy_port>
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.agent =
> > YetAnotherSearchEngine/Nutch-0.9
> > 2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking =
> > true
> > 2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots =
> > true
> > 2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
> > 2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
> > 2007-05-15 17:38:59,832 ERROR http.Http -
> > org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: <crawl_site>: <crawl_site>
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
> HttpBase.java:212)
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > 2007-05-15 17:38:59,832 ERROR http.Http - Caused by:
> > java.net.UnknownHostException: www.gral.pl: www.gral.pl
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > java.net.InetAddress.getAllByName0(InetAddress.java:1128)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > java.net.InetAddress.getAllByName0(InetAddress.java:1098)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > java.net.InetAddress.getAllByName(InetAddress.java:1061)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > java.net.InetAddress.getByName(InetAddress.java:958)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
> > 2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
> > 2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site>
> > failed with: org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: <crawl_site>: <crawl_site>
> >
> >


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to