[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-683: ------------------------------------ Attachment: testWORD_bold_character_runs2.docx testWORD_bold_character_runs.docx TIKA-683.patch New patch attached, including the last (pretty-print) patch, plus I noticed that the OOXML Word parser also split up adjacent bold character runs so I fixed that and added 2 docx files for testing. > RTF Parser issues with non european characters > ---------------------------------------------- > > Key: TIKA-683 > URL: https://issues.apache.org/jira/browse/TIKA-683 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: Nick Burch > Assignee: Chris A. Mattmann > Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, > TIKA-683.patch, testRTFJapanese.rtf, > testUnicodeUCNControlWordCharacterDoubling.rtf, > testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx > > > As reported on user@ in "non-West European languages support": > > http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E > The RTF Parser seems to be doubling up some non-european characters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira