Hello Markus,
> > I'm using the official nutch 1.3 distribution to crawl our internal > > mediawiki instance. Whenever a 404 is encountered, I get a > > > > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed > > > with: java.net.SocketTimeoutException: Read timed out > I cannot confirm this when parsing a local 404 page. What do you get > when fetching that page with: > > bin/nutch org.apache.nutch.parse.ParserChecker > http://wiki.example.org/INTERN_WIKI:Impressum > > you should get a nice 404 I get an error: $ time bin/nutch org.apache.nutch.parse.ParserChecker http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) real 0m13.007s user 0m1.530s sys 0m0.150s Curl does it nicely: $ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum HTTP/1.1 404 Not Found Date: Mon, 01 Aug 2011 11:14:57 GMT Server: Apache/2.2.16 (Debian) X-Powered-By: PHP/5.3.3-7+squeeze3 Content-language: de Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Content-Type: text/html; charset=UTF-8 real 0m0.434s user 0m0.010s sys 0m0.000s -- Viele Grüße Christian Weiske

