RTF parser incorrectly applies fonts to complete group
------------------------------------------------------
Key: TIKA-777
URL: https://issues.apache.org/jira/browse/TIKA-777
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.0
Reporter: Arjohn Kampman
Tika's RTF parser processes the following rtf fragment incorrectly, applying
the wrong character encoding to the parsed characters:
{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fswiss\fcharset204 Arial;}
}
{\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
}
This document contains russian characters (\f1), but tika decodes these as
latin due to the \f0 directive at the end of the group. The RTF parser should
probably flush its pendingBytes buffer before processing directives such as
these.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira