Hello Markus,
> > > I cannot confirm this when parsing a local 404 page. What do you > > > get when fetching that page with: > > > bin/nutch org.apache.nutch.parse.ParserChecker > > I get an error: > > > > $ time bin/nutch org.apache.nutch.parse.ParserChecker > > http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread > > "main" java.lang.NullPointerException at > > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) > Strange! Can you confirm the parse checker with other 404 pages on > the internet? > > bin/nutch org.apache.nutch.parse.ParserChecker > http://nutch.apache.org/404 This does work for me: ------------------ $ bin/nutch org.apache.nutch.parse.ParserChecker http://nutch.apache.org/404 --------- Url --------------- http://nutch.apache.org/404--------- ParseData --------- Version: 5 Status: success(1,0) Title: 404 Not Found Outlinks: 0 Content Metadata: Date=Mon, 01 Aug 2011 11:29:46 GMT Content-Length=309 Content-Type=text/html; charset=iso-8859-1 Connection=close Server=Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 ------------------ > Perhaps your wiki returns some funny data that protocol plugin > doesn't understand. What do you use? Protocol-http or > protocol-httpclient? I do use the standard settings except 3 custom ones in conf/nutch-site.xml: > http.agent.name, fetcher.server.delay and fetcher.threads.per.host When I understood it right, conf/nutch-default.xml contains > <name>plugin.includes</name> > <value>protocol-http|urlfilter-regex|parse-(html|tika) > |index-(basic|anchor)|scoring-opic > |urlnormalizer-(pass|regex|basic)</value> so it's "protocol-http". -- Viele Grüße Christian Weiske

