I cannot confirm this when parsing a local 404 page. What do you get when 
fetching that page with:

bin/nutch org.apache.nutch.parse.ParserChecker 
http://wiki.example.org/INTERN_WIKI:Impressum

you should get a nice 404


On Monday 01 August 2011 08:41:07 Christian Weiske wrote:
> Hello,
> 
> 
> I'm using the official nutch 1.3 distribution to crawl our internal
> mediawiki instance. Whenever a 404 is encountered, I get a
> 
> > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > with: java.net.SocketTimeoutException: Read timed out
> 
> The page really does not exist:
> > $ curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> > HTTP/1.1 404 Not Found
> 
> So I think the error message is misleading. Is that a bug?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to