Dustin Spicuzza created TIKA-2459: ------------------------------------- Summary: Missing text in .doc file (but can be extracted by POI) Key: TIKA-2459 URL: https://issues.apache.org/jira/browse/TIKA-2459 Project: Tika Issue Type: Bug Affects Versions: 1.16 Environment: Windows and Linux Reporter: Dustin Spicuzza Attachments: foo2.doc
I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output. Tika's output: {noformat} Something One: Else Two: Here Three: Four Paragraph two Paragraph three Paragraph four cc: Somebody Somebody else Something here too {noformat} POI's output: {noformat} Something One: Else Two: Here Three: Four Paragraph one Paragraph two Paragraph three Paragraph four cc: Somebody Somebody else Something here too {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)