If Nutch runs on a different machine the DNS may not be resolving the host 
after all. To solve the issue you will have to find a way to resolve the host. 
Take a look in the Nutch logs.
 
 
-----Original message-----
> From:Chethan Prasad <chethan.p...@gmail.com>
> Sent: Thu 07-Jun-2012 16:49
> To: Markus Jelsma <markus.jel...@openindex.io>; user@nutch.apache.org
> Subject: RE: robots.txt UnknownHostException
> 
> Well I can reach it from the browser. So the DNS should be good there.
> 
> Thanks,
> Chethan
> From: Markus Jelsma
> Sent: 6/7/2012 8:07 PM
> To: user@nutch.apache.org
> Subject: RE: robots.txt UnknownHostException
> Hi
> 
> It cannot resolve the host and therefore crawl none of the pages on
> that host. Make sure your DNS settings are correct, the host actually
> exists or add it manually to your hosts file.
> 
> Cheers
> 
> 
> -----Original message-----
> > From:chethan <chethan.p...@gmail.com>
> > Sent: Thu 07-Jun-2012 16:29
> > To: user@nutch.apache.org
> > Subject: Re: robots.txt UnknownHostException
> >
> > But that should not stop it from crawling the rest of the site right? What
> > I'm seeing here is when the  UnknownHostException is thrown from the robots
> > url, the rest of the site is never crawled. Shouldn't it find more links on
> > the root page and follow them?
> >
> > Thanks,
> > Chethan
> >
> > On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma 
> > <markus.jel...@openindex.io>wrote:
> >
> > > Hi,
> > >
> > > Nutch will fetch URL's without robots.txt, but if robots.txt throws an
> > > UnknownHostException, the URL will throw it as well and fail.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:chethan <chethan.p...@gmail.com>
> > > > Sent: Thu 07-Jun-2012 16:16
> > > > To: user@nutch.apache.org
> > > > Subject: robots.txt UnknownHostException
> > > >
> > > > Hi,
> > > >
> > > > When Nutch doesn't find the robots.txt for a given URL, why does it not
> > > > fetch that URL at all? I mean, if the robots is not found, doesn't it
> > > mean
> > > > that the owner of that website doesn't really care about crawlers? So,
> > > it's
> > > > ok for Nutch to fetch from it right?
> > > >
> > > > Thanks,
> > > > Chethan
> > > >
> > >
> >
> 

Reply via email to