Thanks Julien. I have changed nutch-site.xml to have only parse-(tika) instead of parse-(text | html | js | tika) in plugin.includes property. It works now as it doesn't pick up any other parser besides tika.
On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Harry, > > Could you try using parse-tika instead and see if you are getting the same > problem? I gather from your email that you are using Nutch 1.1 or the SVN > version, so parse-tika should be used by default. Have you deactivated it? > > Thanks > > Julien > > On 21 April 2010 11:58, Harry Nutch <harrynu...@gmail.com> wrote: > > > Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to > > fix the problem. > > > > On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch <harrynu...@gmail.com> > wrote: > > > > > Hi, > > > > > > I am running the latest version for nutch. While crawling one > particular > > > site I get a AbstractMethodError in the cyberneko plugin for all of it > > pages > > > when doing a Fetch. > > > As i understand, this has to do because of difference between the > runtime > > > and compile version. However, I am running it afresh after an ant > clean. > > > > > > Any suggestions would be helpful. Btw, i am using java version > "1.6.0_18" > > > on a windows environment > > > > > > > > > java.lang.AbstractMethodError: > > > org.cyberneko.html.HTMLScanner.getCharacterOffset > > > ()I > > > at > org.apache.xerces.xni.parser.XMLParseException.<init>(Unknown > > > Source) > > > > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT > > > MLConfiguration.java:673) > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo > > > nfiguration.java:662) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2404) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2360) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc > > > anner.java:2267) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 > > > 820) > > > at > > > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) > > > at > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 > > > ) > > > at > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 > > > ) > > > at > > > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. > > > java:164) > > > at > > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) > > > > > > at > > > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) > > > at > > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) > > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > > > at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 > > > 9) > > > at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > > > java.lang.AbstractMethodError: > > > org.cyberneko.html.HTMLScanner.getCharacterOffset > > > ()I > > > at > org.apache.xerces.xni.parser.XMLParseException.<init>(Unknown > > > Source) > > > > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT > > > MLConfiguration.java:673) > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo > > > nfiguration.java:662) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2404) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2360) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc > > > anner.java:2267) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 > > > 820) > > > at > > > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) > > > at > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 > > > ) > > > at > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 > > > ) > > > at > > > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. > > > java:164) > > > at > > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) > > > > > > at > > > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) > > > at > > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) > > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > > > at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 > > > 9) > > > at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > > > > > > > > > > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com >