[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682644#comment-13682644 ]
Ray Gauss II commented on TIKA-1130: ------------------------------------ I've created a unit test that reproduces the issue with a stripped down version of the original file. Shall I comment out the actual test and commit? > .docx text extract leaves out some portions of text > --------------------------------------------------- > > Key: TIKA-1130 > URL: https://issues.apache.org/jira/browse/TIKA-1130 > Project: Tika > Issue Type: Bug > Affects Versions: 1.2, 1.3 > Environment: OpenJDK x86_64 > Reporter: Daniel Gibby > Priority: Critical > Attachments: Resume 6.4.13.docx > > > When parsing a Microsoft Word .docx > (application/vnd.openxmlformats-officedocument.wordprocessingml.document), > certain portions of text remain unextracted. > I have attached a .docx file that can be tested against. The 'gray' portions > of text are what are not extracted, while the darker colored text extracts > fine. > Looking at the document.xml portion of the .docx zip file shows the text is > all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira