[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Attachment: NUTCH-840-2.x.patch Patch for 2.X. There currently appears to be a discrepancy in the detection of Outlunks. We are detecting more than the test expects {code} 1 Testsuite: org.apache.nutch.parse.tika.TestDOMContentUtils 2 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.496 sec 3 4 Testcase: testGetTitle took 0.331 sec 5 Testcase: testGetText took 0.069 sec 6 Testcase: testGetOutlinks took 0.08 sec 7 FAILED 8 got wrong number of outlinks (expecting 3, got 5) 9 answer: 10 toUrl: http://www.nutch.org/ anchor: home 11 toUrl: http://www.nutch.org/docs/1 anchor: 1 12 toUrl: http://www.nutch.org/docs/2 anchor: 2 13 14 got: 15 toUrl: http://www.nutch.org/ anchor: home 16 toUrl: http://www.nutch.org/ anchor: 17 toUrl: http://www.nutch.org/docs/1 anchor: 1 18 toUrl: http://www.nutch.org/docs/1 anchor: 19 toUrl: http://www.nutch.org/docs/2 anchor: 2 20 21 22 junit.framework.AssertionFailedError: got wrong number of outlinks (expecting 3, got 5) 23 answer: 24 toUrl: http://www.nutch.org/ anchor: home 25 toUrl: http://www.nutch.org/docs/1 anchor: 1 26 toUrl: http://www.nutch.org/docs/2 anchor: 2 27 28 got: 29 toUrl: http://www.nutch.org/ anchor: home 30 toUrl: http://www.nutch.org/ anchor: 31 toUrl: http://www.nutch.org/docs/1 anchor: 1 32 toUrl: http://www.nutch.org/docs/1 anchor: 33 toUrl: http://www.nutch.org/docs/2 anchor: 2 34 35 36 at org.apache.nutch.parse.tika.TestDOMContentUtils.compareOutlinks(TestDOMContentUtils.ja va:315) 37 at org.apache.nutch.parse.tika.TestDOMContentUtils.testGetOutlinks(TestDOMContentUtils.ja va:296) {code} Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Fix For: 2.4 Attachments: NUTCH-840-2.x.patch, NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Assignee: (was: Julien Nioche) Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Fix For: 2.4 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Fix Version/s: (was: 2.3) 2.4 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.4 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-840: -- Fix Version/s: (was: 1.10) 2.3 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.3 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-840: -- Fix Version/s: 1.8 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.3, 1.8 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Patch Info: Patch Available Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7, 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Attachment: NUTCH-840v2.patch This is for trunk. There is a problem here where the new tests (for parse-tika) also seem to be executed against (within?) other plugin testing scenarios... I am stuck atm as to why this is. Once we fix we will port to 2.x Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Attachment: NUTCH-840-trunk.patch Modified version of the patch to fix the tests post NUTCH-797 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Affects Version/s: 1.6 Fix Version/s: 1.7 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7, 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Fix Version/s: (was: 2.1) 2.2 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Fix Version/s: (was: nutchgora) 2.1 Set and Classify Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.1 Attachments: NUTCH-840.patch, NUTCH-840.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Attachment: NUTCH-840.patch Hi Julien. I have absolutely no idea how or when I ended up working on this, but I think the attachment nearly addresses this issue. It is from a while back and to be honest I can't really remeber working on it... Anyway, I think the parse-tika tests fail as it is not quite working properly yet. The patch also changes the directory structure to o.a.n.p.tika rather than existing o.a.n.tika which is inconsistent with other parser plugin implementation we ship with Nutch. Sorry for hijacking this one slightly. Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: nutchgora Attachments: NUTCH-840.patch, NUTCH-840.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Attachment: NUTCH-840.patch Patch which adds the HTML tests to the Tika Parser The tests currently rely on some DOM related code from Neko-HTML which introduces a dependency to the plugin lib-nekohtml. Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will be removed shortly. Once this is done we can delete lib-nekohtml as well then either : a) add the neko jar to the parse-tika lib via IVY b) replace it with another implementation already available from the tika dependencies or the main Nutch dependencies (e.g. dom4j) Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-840.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.