[
https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864392#action_12864392
]
Nick Burch commented on TIKA-405:
---------------------------------
HWPF (the POI component that handles .doc files) is in need of a bit of love -
the main developer of the component left some time ago, and no new champion has
taken it up again...
I would suggest opening a new bug against poi. To that, I'd suggest adding a
few very simple .doc files, something like:
* 2 paragraphs of text
* 2 paragraphs of text, which contains some hyperlinks
* 2 paragraphs of text, separated by a simple table
You might then want to look at seeing how the files differ, and how HWPF sees
them differing, then contribute a patch to hwpf.extractor.WordExtractor :)
> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
> Key: TIKA-405
> URL: https://issues.apache.org/jira/browse/TIKA-405
> Project: Tika
> Issue Type: Bug
> Affects Versions: 0.7
> Environment: 32-bit Ubuntu Linux
> Reporter: Curtis Warner
> Attachments: actual.txt, expected.txt, WordDocWithLinksAndTable.doc
>
>
> I discovered some odd behavior while running a three-way comparison test
> between Tika, Aperture, and Autonomy KeyView. The input file was a test Word
> 97 Doc (attached) including a paragraph peppered with hyperlinks and a table
> filled with dummy text. KeyView generated the full text, as I expected.
> Aperture and Tika had identical results to one another (barring one lost
> whitespace character), but their outputs yielded significantly fewer tokens
> than KeyView's did. I've attached the output text from KeyView and Tika for
> reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They
> appear to have been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a
> single blob rather than being emitted separately, which ruins any attempt at
> tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test
> file, my guess is that it's a problem with the shared POI library. I thought
> it would be worth noting, though, in case there's an easy fix on the Tika end
> of things.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.