Re: bug in parse-tika or Tika RTFParser?

Ken Krugler Wed, 15 Aug 2012 17:29:28 -0700

Hi Lewis,

[Moving to the dev list]


For many Tika parsers, the text you get back from the document starts with the 
title (if any), and then contains the body.

So I'm wondering if what you're seeing in the test failure is that the 
parse.getText() result is actually "test rtf document\nThe quick brown fox…"

-- Ken

On Aug 15, 2012, at 12:49pm, Lewis John Mcgibbney wrote:

> Hi,
> 
> For some time (in 2.x) we have commented out this test as it was
> waiting for TIKA-748 to be resolved... which now has been resolved
> however I'm getting some confusing output when trying to resurrect the
> test!
> 
> So @line 105 we do
> 
> String text = parse.getText();
> assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
> 
> But I was wanting to implement the suggested test for title e.g.
> 
> String title = parse.getTitle();
> String text = parse.getText();
> assertEquals("test rft document", title);
> assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
> 
> The test fails on the 2nd assertion which with the following
> 
> Testcase: testIt took 5.668 sec
>       FAILED
> null expected:<[The quick brown fox jumps over the lazy dog]> but
> was:<[test rft document]>
> junit.framework.ComparisonFailure: null expected:<[The quick brown fox
> jumps over the lazy dog]> but was:<[test rft document]>
>       at org.apache.nutch.parse.tika.TestRTFParser.testIt(TestRTFParser.java:)
> 
> So this looks like parse.getText() returns the same (in this instance)
> as parse.getTitle()... which smells like rotting herring to me.
> 
> Any immediate thoughts whether this is a known problem in the Tika RTF
> parser, parse-tika's DomContentUtils class or somewhere in between?
> 
> Thank you
> 
> Lewis
> 
> -- 
> Lewis

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: bug in parse-tika or Tika RTFParser?

Reply via email to