[ 
https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864392#action_12864392
 ] 

Nick Burch commented on TIKA-405:
---------------------------------

HWPF (the POI component that handles .doc files) is in need of a bit of love - 
the main developer of the component left some time ago, and no new champion has 
taken it up again...

I would suggest opening a new bug against poi. To that, I'd suggest adding a 
few very simple .doc files, something like:
* 2 paragraphs of text
* 2 paragraphs of text, which contains some hyperlinks
* 2 paragraphs of text, separated by a simple table

You might then want to look at seeing how the files differ, and how HWPF sees 
them differing, then contribute a patch to hwpf.extractor.WordExtractor :)

> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
>                 Key: TIKA-405
>                 URL: https://issues.apache.org/jira/browse/TIKA-405
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: 32-bit Ubuntu Linux
>            Reporter: Curtis Warner
>         Attachments: actual.txt, expected.txt, WordDocWithLinksAndTable.doc
>
>
> I discovered some odd behavior while running a three-way comparison test 
> between Tika, Aperture, and Autonomy KeyView. The input file was a test Word 
> 97 Doc (attached) including a paragraph peppered with hyperlinks and a table 
> filled with dummy text. KeyView generated the full text, as I expected. 
> Aperture and Tika had identical results to one another (barring one lost 
> whitespace character), but their outputs yielded significantly fewer tokens 
> than KeyView's did. I've attached the output text from KeyView and Tika for 
> reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They 
> appear to have been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a 
> single blob rather than being emitted separately, which ruins any attempt at 
> tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test 
> file, my guess is that it's a problem with the shared POI library. I thought 
> it would be worth noting, though, in case there's an easy fix on the Tika end 
> of things.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to