Strange! Can you confirm the parse checker with other 404 pages on the internet?
bin/nutch org.apache.nutch.parse.ParserChecker http://nutch.apache.org/404 Perhaps your wiki returns some funny data that protocol plugin doesn't understand. What do you use? Protocol-http or protocol-httpclient? On Monday 01 August 2011 13:17:06 Christian Weiske wrote: > Hello Markus, > > > > I'm using the official nutch 1.3 distribution to crawl our internal > > > mediawiki instance. Whenever a 404 is encountered, I get a > > > > > > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed > > > > with: java.net.SocketTimeoutException: Read timed out > > > > I cannot confirm this when parsing a local 404 page. What do you get > > when fetching that page with: > > > > bin/nutch org.apache.nutch.parse.ParserChecker > > http://wiki.example.org/INTERN_WIKI:Impressum > > > > you should get a nice 404 > > I get an error: > > $ time bin/nutch org.apache.nutch.parse.ParserChecker > http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main" > java.lang.NullPointerException at > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) > > real 0m13.007s > user 0m1.530s > sys 0m0.150s > > > Curl does it nicely: > > $ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum > HTTP/1.1 404 Not Found > Date: Mon, 01 Aug 2011 11:14:57 GMT > Server: Apache/2.2.16 (Debian) > X-Powered-By: PHP/5.3.3-7+squeeze3 > Content-language: de > Vary: Accept-Encoding,Cookie > Expires: Thu, 01 Jan 1970 00:00:00 GMT > Cache-Control: private, must-revalidate, max-age=0 > Content-Type: text/html; charset=UTF-8 > > > real 0m0.434s > user 0m0.010s > sys 0m0.000s -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

