[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Cristian Vat (JIRA) Thu, 18 Aug 2011 16:03:56 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087367#comment-13087367
 ]


Cristian Vat commented on TIKA-683:
-----------------------------------

Thanks Mike for looking into the issues. I also know very little about RTF :)

Yes, the skipping is basically skip N ansi chars.
Actually the JDK RTFEditorKit/Reader already does this and does it well as far 
as I could see.

There are also other flaws with the current filtering we do. For example binary 
data sequences skipping is not handled correctly...

I went through all the classes in/used-by RTFEditorKit and it appears that it 
handles most things correctly except the "\'xx" escape where it uses a default 
translation table not taking into account the current font charset.
Right now I'm trying to figure out if I can add that behavior by subclassing 
RTFEditorKit/RTFReader. That I think would be the best solution to this issue 
and other related ones. It would also avoid temporary files and improve 
performance maybe.

Just in case it can't be done with subclassing, anybody know what the licensing 
restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ). It may 
be do-able with modifying them a little...

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Reply via email to