Hello Markus,

> > I'm using the official nutch 1.3 distribution to crawl our internal
> > mediawiki instance. Whenever a 404 is encountered, I get a
> > 
> > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > > with: java.net.SocketTimeoutException: Read timed out

> I cannot confirm this when parsing a local 404 page. What do you get
> when fetching that page with:
> 
> bin/nutch org.apache.nutch.parse.ParserChecker 
> http://wiki.example.org/INTERN_WIKI:Impressum
> 
> you should get a nice 404


I get an error:

$ time bin/nutch org.apache.nutch.parse.ParserChecker
http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main"
java.lang.NullPointerException at
org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

real    0m13.007s
user    0m1.530s
sys     0m0.150s


Curl does it nicely:

$ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum
HTTP/1.1 404 Not Found
Date: Mon, 01 Aug 2011 11:14:57 GMT
Server: Apache/2.2.16 (Debian)
X-Powered-By: PHP/5.3.3-7+squeeze3
Content-language: de
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Content-Type: text/html; charset=UTF-8


real    0m0.434s
user    0m0.010s
sys     0m0.000s


-- 
Viele Grüße
Christian Weiske

Reply via email to