[ 
https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Tran updated TIKA-423:
----------------------------

    Attachment: tika_test.docx
                output.txt

Test docx file, and the output produced by TIka 0.7

> Parse docx and output to text file missing words
> ------------------------------------------------
>
>                 Key: TIKA-423
>                 URL: https://issues.apache.org/jira/browse/TIKA-423
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows and Mac
>            Reporter: David Tran
>         Attachments: output.txt, tika_test.docx
>
>
> I created a word document using Word 2007 on a Windows Server 2003 machine 
> (using Remote desktop), it has also happened to someone else using Windows 
> XP, with person names, country names, addresses, and a date. Some of these 
> elements are tagged as "Smart Tags" by Word, and in the output of parsing by 
> Tika, some words disappear.
> So a text fragment like the one below in Word:
> Smart tags typically are names like George Washington, Marilyn Monroe, 
> Napoleon Bonaparte, etc. But they are automatically generated by Word, so it 
> can be difficult to control how they are 
> After running Tika from the command line (on OSX), java -jar 
> /path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > 
> /path/to/output.txt will result in something like:
> Smart tags typically are names like  , , Napoleon Bonaparte, etc. But they 
> are automatically generated by Word, so it can be difficult to control how 
> they are
> Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe 
> was one that was tagged by Word.
> While I've only tried this with Tika 0.7, my understanding is that it has 
> been an issue since 0.3 at least.
> Removing all Smart tags from the document using Autocorrect options in Word 
> will result in the correct output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to