Strange! Can you confirm the parse checker with other 404 pages on the 
internet?

bin/nutch org.apache.nutch.parse.ParserChecker http://nutch.apache.org/404

Perhaps your wiki returns some funny data that protocol plugin doesn't 
understand. What do you use? Protocol-http or protocol-httpclient?

On Monday 01 August 2011 13:17:06 Christian Weiske wrote:
> Hello Markus,
> 
> > > I'm using the official nutch 1.3 distribution to crawl our internal
> > > mediawiki instance. Whenever a 404 is encountered, I get a
> > > 
> > > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > > > with: java.net.SocketTimeoutException: Read timed out
> > 
> > I cannot confirm this when parsing a local 404 page. What do you get
> > when fetching that page with:
> > 
> > bin/nutch org.apache.nutch.parse.ParserChecker
> > http://wiki.example.org/INTERN_WIKI:Impressum
> > 
> > you should get a nice 404
> 
> I get an error:
> 
> $ time bin/nutch org.apache.nutch.parse.ParserChecker
> http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main"
> java.lang.NullPointerException at
> org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
> 
> real  0m13.007s
> user  0m1.530s
> sys   0m0.150s
> 
> 
> Curl does it nicely:
> 
> $ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> HTTP/1.1 404 Not Found
> Date: Mon, 01 Aug 2011 11:14:57 GMT
> Server: Apache/2.2.16 (Debian)
> X-Powered-By: PHP/5.3.3-7+squeeze3
> Content-language: de
> Vary: Accept-Encoding,Cookie
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: private, must-revalidate, max-age=0
> Content-Type: text/html; charset=UTF-8
> 
> 
> real  0m0.434s
> user  0m0.010s
> sys   0m0.000s

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to