[ https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162745#comment-13162745 ]
Fabian Lange commented on TIKA-423: ----------------------------------- yes, its the same reason. My proposed poi patch fixes both. > Parse docx and output to text file missing words > ------------------------------------------------ > > Key: TIKA-423 > URL: https://issues.apache.org/jira/browse/TIKA-423 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7, 0.8, 0.9, 0.10 > Environment: Windows and Mac > Reporter: David Tran > Labels: docx, missing_word, smart_tag, word > Attachments: output.txt, tika_test.docx > > > I created a word document using Word 2007 on a Windows Server 2003 machine > (using Remote desktop), it has also happened to someone else using Windows > XP, with person names, country names, addresses, and a date. Some of these > elements are tagged as "Smart Tags" by Word, and in the output of parsing by > Tika, some words disappear. > So a text fragment like the one below in Word: > Smart tags typically are names like George Washington, Marilyn Monroe, > Napoleon Bonaparte, etc. But they are automatically generated by Word, so it > can be difficult to control how they are > After running Tika from the command line (on OSX), java -jar > /path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > > /path/to/output.txt will result in something like: > Smart tags typically are names like , , Napoleon Bonaparte, etc. But they > are automatically generated by Word, so it can be difficult to control how > they are > Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe > was one that was tagged by Word. > While I've only tried this with Tika 0.7, my understanding is that it has > been an issue since 0.3 at least. > Removing all Smart tags from the document using Autocorrect options in Word > will result in the correct output. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira