[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Michael McCandless (JIRA) Mon, 22 Aug 2011 15:14:53 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated TIKA-683:
------------------------------------

    Attachment: testWORD_bold_character_runs2.docx
                testWORD_bold_character_runs.docx
                TIKA-683.patch

New patch attached, including the last (pretty-print) patch, plus I noticed 
that the OOXML Word parser also split up adjacent bold character runs so I 
fixed that and added 2 docx files for testing.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> TIKA-683.patch, testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf, 
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

Reply via email to