Hi, I'm testing my custom parser plugin for nutch 1.2, which match some regular expression in the content and store these matched text into my database. When I test it in eclipse, everything worked well. But if I use it in my production environment. Some warnings were logged in hadoop.log like following:
> 2011-06-11 00:33:06,760 WARN parse.ParserFactory - ParserFactory:Plugin: >> org.apache.nutch.parse.html.HtmlParser mapped to contentType >> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does >> not claim to support contentType: application/xhtml+xml > > 2011-06-11 00:33:07,302 INFO fetcher.Fetcher - -activeThreads=1, >> spinWaiting=0, fetchQueues.totalSize=0 > > 2011-06-11 00:33:08,303 INFO fetcher.Fetcher - -activeThreads=1, >> spinWaiting=0, fetchQueues.totalSize=0 > > 2011-06-11 00:33:09,303 INFO fetcher.Fetcher - -activeThreads=1, >> spinWaiting=0, fetchQueues.totalSize=0 > > 2011-06-11 00:33:09,940 WARN parse.ParseUtil - Unable to successfully >> parse content http://www.eccom.com.cn/EN/ of type application/xhtml+xml > > 2011-06-11 00:33:09,943 WARN fetcher.Fetcher - Error parsing: >> http://www.eccom.com.cn/EN/: failed(2,200): >> org.apache.nutch.parse.ParseException: Unable to successfully parse content > > When I remove the plugin in nutch-site.xml, crawling worked correctly. Any idea? Thanks.