[ https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-207. -------------------------------- Resolution: Fixed Fix Version/s: 1.0 Assignee: Jukka Zitting Thanks, Curt! Patch committed in revision 1164471. > MS word doc containing tracked changes produces incorrect text > -------------------------------------------------------------- > > Key: TIKA-207 > URL: https://issues.apache.org/jira/browse/TIKA-207 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.3 > Environment: tika-0.3-standalone.jar > Reporter: Michael McCandless > Assignee: Jukka Zitting > Priority: Minor > Fix For: 1.0 > > Attachments: TIKA-207.patch, TIKA-207.patch > > > Spinoff from this discussion: > > http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html > When extracting text from an MS Word doc (2003 format) that has > unapproved pending changes, the text from both old and new is glommed > together. > EG I had a doc that contained text "Field.Index.TOKENIZED", and I > changed TOKENIZED to ANALYZED with track changes enabled, and > then when I extract text (using TikaCLI) it produces this: > Field.Index.TOKENIZEDANALYZED > So, first, it'd be nice to at least get whitespace inserted between > old & new text. > And, second, it'd be great to have an option to control whether it's > old or new text that's indexed (or at least an option to only see > "new" text, ie the current document). > From the discussion above, it seems like POI may expose the > fine-grained APIs to allow Tika to do this; it's just that Tika's not > leveraging these APIs for MS Word docs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira