Thanks Julien.
I have changed nutch-site.xml to have only parse-(tika) instead of
parse-(text | html | js | tika) in plugin.includes property.
It works now as it doesn't pick up any other parser besides tika.

On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Harry,
>
> Could you try using parse-tika instead and see if you are getting the same
> problem? I gather from your email that you are using Nutch 1.1 or the SVN
> version, so parse-tika should be used by default. Have you deactivated it?
>
> Thanks
>
> Julien
>
> On 21 April 2010 11:58, Harry Nutch <harrynu...@gmail.com> wrote:
>
> > Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to
> > fix the problem.
> >
> > On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch <harrynu...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I am running the latest version for nutch. While crawling one
> particular
> > > site I get a AbstractMethodError in the cyberneko plugin for all of it
> > pages
> > > when doing a Fetch.
> > > As i understand, this has to do because of difference between the
> runtime
> > > and compile version. However, I am running it afresh after an ant
> clean.
> > >
> > > Any suggestions would be helpful. Btw, i am using java version
> "1.6.0_18"
> > > on a windows environment
> > >
> > >
> > > java.lang.AbstractMethodError:
> > > org.cyberneko.html.HTMLScanner.getCharacterOffset
> > > ()I
> > >         at
> org.apache.xerces.xni.parser.XMLParseException.<init>(Unknown
> > > Source)
> > >
> > >         at
> > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
> > > MLConfiguration.java:673)
> > >         at
> > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
> > > nfiguration.java:662)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
> > > er.java:2404)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
> > > er.java:2360)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
> > > anner.java:2267)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
> > > 820)
> > >         at
> > > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
> > >         at
> > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
> > > )
> > >         at
> > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
> > > )
> > >         at
> > > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
> > > java:164)
> > >         at
> > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
> > >
> > >         at
> > > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
> > >         at
> > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
> > >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> > >         at
> > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
> > > 9)
> > >         at
> > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
> > > java.lang.AbstractMethodError:
> > > org.cyberneko.html.HTMLScanner.getCharacterOffset
> > > ()I
> > >         at
> org.apache.xerces.xni.parser.XMLParseException.<init>(Unknown
> > > Source)
> > >
> > >         at
> > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
> > > MLConfiguration.java:673)
> > >         at
> > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
> > > nfiguration.java:662)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
> > > er.java:2404)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
> > > er.java:2360)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
> > > anner.java:2267)
> > >         at
> > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
> > > 820)
> > >         at
> > > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
> > >         at
> > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
> > > )
> > >         at
> > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
> > > )
> > >         at
> > > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
> > > java:164)
> > >         at
> > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
> > >
> > >         at
> > > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
> > >         at
> > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
> > >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> > >         at
> > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
> > > 9)
> > >         at
> > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
> > >
> > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>

Reply via email to