Hi Lewis, The good news with this overriding, I don't get the Neko error parsing but I saw in the hadoop.log this outputs :
2013-01-09 06:29:43,738 INFO parse.ParserJob - Parsing http://www.ab-advisory.com/ 2013-01-09 06:29:43,745 WARN parse.ParserFactory - ParserFactory: Plugin: org.apache.nutch.parse.tika.TikaParser mapped to contentType application/xhtml+xml via parse$ 2013-01-09 06:29:43,745 WARN parse.ParserFactory - ParserFactory: Plugin: org.apache.nutch.parse.tika.TikaParser mapped to contentType * via parse-plugins.xml, but no$ 2013-01-09 06:29:43,745 WARN parse.ParseUtil - *No suitable parser found: parser not found for contentType=application/xhtml+xml url= http://www.ab-advisory.com/* 2013-01-09 06:29:46,466 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-01-09 06:29:47,435 INFO parse.ParserJob - ParserJob: success seems that tika cannot parse html file ? am I wrong ? kr, Arcondo On Wed, Jan 9, 2013 at 12:22 AM, Lewis John Mcgibbney < [email protected]> wrote: > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> >

