[Issue 128549] Wrong accentuated characters from old .rtf files

bugzilla Thu, 05 Jan 2023 08:37:43 -0800

https://bz.apache.org/ooo/show_bug.cgi?id=128549


--- Comment #5 from dam...@apache.org ---
Our lower level RTF parser is in main/svtools/source/svrtf/parrtf.cxx, and
SvRTFParser::_GetNextToken() calls SvRTFParser::ScanText() which parses the
"\'8e" by treating it as 1 byte, in hexadecimal encoding. Other permissively
licensed open-source projects like rtf.js do the same
(https://github.com/tbluemel/rtf.js/blob/master/src/rtfjs/parser/Parser.ts#L422).

And the RTF 1.0 spec from https://latex2rtf.sourceforge.net/RTF-Spec-1.0.txt
confirms it:

---snip---
  \'hh            A hexadecimal value, based on the specified
                  character set (may be used to identify 8-bit
                  values).
---snip---

So "\'8e" becomes the byte 0x8e, but then how does that become "é"?

What is this "specified character set"?

The file begins with:

00000000  7b 5c 72 74 66 30 5c 6d  61 63 20 0d 7b 5c 63 6f  |{\rtf0\mac .{\co|

and the RTF spec says under the "THE CHARACTER SET" section:

---snip---
    \mac           Apple Macintosh
---snip---

The "\mac" should be parsed in SvRTFParser::Continue() where we have:

---snip---
    654         case RTF_MACTYPE:       
    655             SetEncoding( eCodeSet = RTL_TEXTENCODING_APPLE_ROMAN );     
    656             break;
---snip---

as svtools/inc/svtools/rtfkeywd.hxx had:

---snip---
#define OOO_STRING_SVTOOLS_RTF_MAC "\\mac"
---snip---

Our character set conversions are generally done under main/sal/textenc, and in
main/sal/textenc/tcvtlab1.tab we have:

---snip---
static sal_uInt16 const aImplAPPLEROMANToUniTab[APPLEROMANUNI_END -
APPLEROMANUNI_START + 1] =
{
/*       0       1       2       3       4       5       6       7 */
/*       8       9       A       B       C       D       E       F */
    0x00C4, 0x00C5, 0x00C7, 0x00C9, 0x00D1, 0x00D6, 0x00DC, 0x00E1, /* 0x80 */
    0x00E0, 0x00E2, 0x00E4, 0x00E3, 0x00E5, 0x00E7, 0x00E9, 0x00E8, /* 0x80 */
---snip---

which would translate 0x8E into unicode 0x00E9, which is "é" (U+00E9), the
expected character.

But we got "Ž" (U+017D) instead in this sample document. Searching that file
for "17D" we see it comes from the table for the MS 1252 encoding:

---snip---
static sal_uInt16 const aImplMS1252ToUniTab[MS1252UNI_END - MS1252UNI_START +
1] =
{
/*       0       1       2       3       4       5       6       7 */
/*       8       9       A       B       C       D       E       F */
    0x20AC,      0, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, /* 0x80 */
    0x02C6, 0x2030, 0x0160, 0x2039, 0x0152,      0, 0x017D,      0, /* 0x80 */
---snip---

and indeed, if we look at the constructor for SvRTFParser, we see that's the
encoding it initially sets:

---snip---
SvRTFParser::SvRTFParser( SvStream& rIn, sal_uInt8 nStackSize )
    : SvParser( rIn, nStackSize ),
    eUNICodeSet( RTL_TEXTENCODING_MS_1252 ),    // default ist ANSI-CodeSet
    nUCharOverread( 1 )
{
    // default ist ANSI-CodeSet
    SetSrcEncoding( RTL_TEXTENCODING_MS_1252 );
    bRTF_InTextRead = false;
}
---snip---

But SvRTFParser::Continue() must be getting called after the constructor, and
it seems to set the "mac" encoding, so why is the wrong encoding still used?

-- 
You are receiving this mail because:
You are the assignee for the issue.

[Issue 128549] Wrong accentuated characters from old .rtf files

Reply via email to