[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER updated NUTCH-1253: Attachment: NUTCH-1253-2.x-eclipse.patch [~icebergx5] is right. At the present 2.x branch does not work with eclipse. Eclipse says Missing required library about neko 0.9.5. I think [~lewismc] forget adding nekohtml dependecy for eclipse target in build.xml. I create a bugfix patch for 2.x. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-eclipse.patch, NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1253: --- Attachment: NUTCH-1253-trunk.v2.patch Hi [~lewismc], the HTML which fails to parse looks not really incorrect: both {{a name=.../}} and {{iframe src=.../}} are empty XML-style tags (bachelor tags). According to Neko's [Change History|http://nekohtml.sourceforge.net/changes.html] a configuration feature allow-selfclosing-iframe was introduced in v1.19.15. If the feature is set to true, the problematic document is parsed successfully. Attached patch adds allow-selfclosing-iframe for both parse-html (parse plugin and test) and parse-tika (test only). Tests now pass. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Attachment: NUTCH-1253-trunk.patch Patch for trunk. The lapse in our testing code was a combination of the closing /a and the iframe tag which needed to be closed as well. I've made the changes to TestDOMContentutils in parse-html and parse-tika, updated all plugin and ivy configuration and also formatted TestDOMContentUtils as per the Nutch code formatting. This was a paint o track down but eventually we got there. Thanks if anyone can review. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Attachment: TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt Actually, I can confirm that this upgrade seems to break tests in TestDOMContentUtils (see attached). It seems that the document fragment contains some markup... which is incorrect. Having sat in Eclipse for ages debugging this, I am now over on the nekohtml user list trying to sort this out. If you guys have any ideas then please chip in. I am not sure whether it's the way we use Xerces2, Neko or maybe a bug in DomContentUtils but there is undesired behavior anyway. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1253: --- Attachment: nutch1253test.html nutch1253parsed.html It's likely a regression in NekoHTML: {{a name=bottom/}} encloses erroneously the rest of the document inclusively {{/body/html}} which is interpreted as textual content. See attached document (taken from failed test unit) and output by parse-html using Neko 1.9.15/19. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1253: --- Fix Version/s: 1.8 Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Fix Version/s: 2.2 1.7 Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Fix For: 1.7, 2.2 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Attachment: NUTCH-1253-nutchgora.patch NUTCH-1253.patch Trivial patches for both trunk and Nutchgora branch. Can you guys please test and get back on this issue. Thanks Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Patch Info: Patch Available Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira