[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066026#comment-13066026 ]
Nick Burch commented on TIKA-683: --------------------------------- I couldn't use the test as-is, as it contains raw japanese characters in an unknown encoding (rather than \uxxxx escape sequences), and the sample file was too large I've re-saved the sample file without the images, and tested with that. That does extract exactly as expected - no doubling up occurs. I've added a unit test for this in r1147200. Are you able to get a small RTF file that does shows the problem, along with a suitable unit test similar to the testJapaneseText() method in RTFParser? > RTF Parser issues with non european characters > ---------------------------------------------- > > Key: TIKA-683 > URL: https://issues.apache.org/jira/browse/TIKA-683 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: Nick Burch > Attachments: testRTFJapanese.rtf > > > As reported on user@ in "non-West European languages support": > > http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E > The RTF Parser seems to be doubling up some non-european characters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira