[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2015-02-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---
Attachment: NUTCH-840-2.x.patch

Patch for 2.X.
There currently appears to be a discrepancy in the detection of Outlunks. We 
are detecting more than the test expects

{code}
  1 Testsuite: org.apache.nutch.parse.tika.TestDOMContentUtils
  2 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.496 sec
  3
  4 Testcase: testGetTitle took 0.331 sec
  5 Testcase: testGetText took 0.069 sec
  6 Testcase: testGetOutlinks took 0.08 sec
  7 FAILED
  8 got wrong number of outlinks (expecting 3, got 5)
  9 answer:
 10 toUrl: http://www.nutch.org/ anchor: home
 11 toUrl: http://www.nutch.org/docs/1 anchor: 1
 12 toUrl: http://www.nutch.org/docs/2 anchor: 2
 13
 14 got:
 15 toUrl: http://www.nutch.org/ anchor: home
 16 toUrl: http://www.nutch.org/ anchor:
 17 toUrl: http://www.nutch.org/docs/1 anchor: 1
 18 toUrl: http://www.nutch.org/docs/1 anchor:
 19 toUrl: http://www.nutch.org/docs/2 anchor: 2
 20
 21
 22 junit.framework.AssertionFailedError: got wrong number of outlinks 
(expecting 3, got 5)
 23 answer:
 24 toUrl: http://www.nutch.org/ anchor: home
 25 toUrl: http://www.nutch.org/docs/1 anchor: 1
 26 toUrl: http://www.nutch.org/docs/2 anchor: 2
 27
 28 got:
 29 toUrl: http://www.nutch.org/ anchor: home
 30 toUrl: http://www.nutch.org/ anchor:
 31 toUrl: http://www.nutch.org/docs/1 anchor: 1
 32 toUrl: http://www.nutch.org/docs/1 anchor:
 33 toUrl: http://www.nutch.org/docs/2 anchor: 2
 34
 35
 36 at 
org.apache.nutch.parse.tika.TestDOMContentUtils.compareOutlinks(TestDOMContentUtils.ja
va:315)
 37 at 
org.apache.nutch.parse.tika.TestDOMContentUtils.testGetOutlinks(TestDOMContentUtils.ja
va:296)
{code}

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
 Fix For: 2.4

 Attachments: NUTCH-840-2.x.patch, NUTCH-840-trunk.patch, 
 NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2015-01-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:

Assignee: (was: Julien Nioche)

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
 Fix For: 2.4

 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2014-11-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---
Fix Version/s: (was: 2.3)
   2.4

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.4

 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2014-08-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-840:
--

Fix Version/s: (was: 1.10)
   2.3

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.3

 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2013-05-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-840:
--

Fix Version/s: 1.8

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.3, 1.8

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2013-01-07 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Patch Info: Patch Available

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.7, 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Attachment: NUTCH-840v2.patch

This is for trunk.
There is a problem here where the new tests (for parse-tika) also seem to be 
executed against (within?) other plugin testing scenarios... I am stuck atm as 
to why this is.
Once we fix we will port to 2.x

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840-trunk.patch

Modified version of the patch to fix the tests post NUTCH-797

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Affects Version/s: 1.6
Fix Version/s: 1.7

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.7, 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Fix Version/s: (was: 2.1)
   2.2

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.2

 Attachments: NUTCH-840.patch, NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-04-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Fix Version/s: (was: nutchgora)
   2.1

Set and Classify

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.1

 Attachments: NUTCH-840.patch, NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-01-09 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Attachment: NUTCH-840.patch

Hi Julien. I have absolutely no idea how or when I ended up working on this, 
but I think the attachment nearly addresses this issue. It is from a while back 
and to be honest I can't really remeber working on it...

Anyway, I think the parse-tika tests fail as it is not quite working properly 
yet. The patch also changes the directory structure to o.a.n.p.tika rather than 
existing o.a.n.tika which is inconsistent with other parser plugin 
implementation we ship with Nutch.

Sorry for hijacking this one slightly.

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-840.patch, NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika

2010-07-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840.patch

Patch which adds the HTML tests to the Tika Parser

The tests currently rely on some DOM related code from Neko-HTML which 
introduces a dependency to the plugin lib-nekohtml.
Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will 
be removed shortly. Once this is done we can delete lib-nekohtml as well then 
either : 
a) add the neko jar to the parse-tika lib via IVY
b) replace it with another implementation already available from the tika 
dependencies or the main Nutch dependencies (e.g. dom4j)





 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.