[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1253: ----------------------------------- Attachment: nutch1253test.html nutch1253parsed.html It's likely a regression in NekoHTML: {{<a name="bottom"/>}} encloses erroneously the rest of the document inclusively {{</body></html>}} which is interpreted as textual content. See attached document (taken from failed test unit) and output by parse-html using Neko 1.9.15/19. > Incompatible neko and xerces versions > ------------------------------------- > > Key: NUTCH-1253 > URL: https://issues.apache.org/jira/browse/NUTCH-1253 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.4 > Environment: Ubuntu 10.04 > Reporter: Dennis Spathis > Assignee: Lewis John McGibbney > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, > NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, > TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, > nutch1253parsed.html, nutch1253test.html > > > The Nutch 1.4 distribution includes > - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- > nekohtml) > - xercesImpl-2.9.1.jar (under .../runtime/local/lib) > These two JARs appear to be incompatible versions. When the HtmlParser > (configured to use neko) is invoked during a local-mode crawl, the parse > fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, > rebuild the HtmlParser plugin and add a > catch(Throwable) clause in the getParse method to log the stacktrace.) > I found that substituting a later, compatible version of nekohtml (1.9.11) > fixes the problem. > Curiously, and in support of the above, the nekohtml plugin.xml file in > Nutch 1.4 contains the following: > <plugin > id="lib-nekohtml" > name="CyberNeko HTML Parser" > version="1.9.11" > provider-name="org.cyberneko"> > <runtime> > <library name="nekohtml-0.9.5.jar"> > <export name="*"/> > </library> > </runtime> > </plugin> > Note the conflicting version numbers (version tag is "1.9.11" but the > specified library is "nekohtml-0.9.5.jar"). > Was the 0.9.5 version included by mistake? Was the intention rather to > include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)