[ 
https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Tran updated TIKA-423:
----------------------------

    Description: 
I created a word document using Word 2007 on a Windows Server 2003 machine 
(using Remote desktop), it has also happened to someone else using Windows XP, 
with person names, country names, addresses, and a date. Some of these elements 
are tagged as "Smart Tags" by Word, and in the output of parsing by Tika, some 
words disappear.

So a text fragment like the one below in Word:
Smart tags typically are names like George Washington, Marilyn Monroe, Napoleon 
Bonaparte, etc. But they are automatically generated by Word, so it can be 
difficult to control how they are 

After running Tika from the command line (on OSX), java -jar 
/path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > /path/to/output.txt 
will result in something like:
Smart tags typically are names like  , , Napoleon Bonaparte, etc. But they are 
automatically generated by Word, so it can be difficult to control how they are

Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe was 
one that was tagged by Word.

While I've only tried this with Tika 0.7, my understanding is that it has been 
an issue since 0.3 at least.

Removing all Smart tags from the document using Autocorrect options in Word 
will result in the correct output.

  was:
I created a word document using Word 2007 on a Windows Server 2003 machine 
(using Remote desktop), it has also happened to someone else using Windows XP, 
with person names, country names, addresses, and a date. Some of these elements 
are tagged as "Smart Tags" by Word, and in the output of parsing by Tika, some 
words disappear.

So a text fragment like the one below in Word:
Smart tags typically are names like George Washington, Marilyn Monroe, Napoleon 
Bonaparte, etc. But they are automatically generated by Word, so it can be 
difficult to control how they are 

After running Tika from the command line (on OSX), java -jar 
/path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > /path/to/output.txt 
will result in something like:
Smart tags typically are names like  , , Napoleon Bonaparte, etc. But they are 
automatically generated by Word, so it can be difficult to control how they are

Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe was 
one that was tagged by Word.

While I've only tried this with Tika 0.7, my understanding is that it has been 
an issue since 0.3 at least.

Removing all Smart tags from the document using Autocorrect options in Word 
will result in the correct output. I can attach a sample document and output 
text file if that will help.


> Parse docx and output to text file missing words
> ------------------------------------------------
>
>                 Key: TIKA-423
>                 URL: https://issues.apache.org/jira/browse/TIKA-423
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows and Mac
>            Reporter: David Tran
>
> I created a word document using Word 2007 on a Windows Server 2003 machine 
> (using Remote desktop), it has also happened to someone else using Windows 
> XP, with person names, country names, addresses, and a date. Some of these 
> elements are tagged as "Smart Tags" by Word, and in the output of parsing by 
> Tika, some words disappear.
> So a text fragment like the one below in Word:
> Smart tags typically are names like George Washington, Marilyn Monroe, 
> Napoleon Bonaparte, etc. But they are automatically generated by Word, so it 
> can be difficult to control how they are 
> After running Tika from the command line (on OSX), java -jar 
> /path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > 
> /path/to/output.txt will result in something like:
> Smart tags typically are names like  , , Napoleon Bonaparte, etc. But they 
> are automatically generated by Word, so it can be difficult to control how 
> they are
> Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe 
> was one that was tagged by Word.
> While I've only tried this with Tika 0.7, my understanding is that it has 
> been an issue since 0.3 at least.
> Removing all Smart tags from the document using Autocorrect options in Word 
> will result in the correct output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to