https://bz.apache.org/ooo/show_bug.cgi?id=128549
--- Comment #5 from dam...@apache.org --- Our lower level RTF parser is in main/svtools/source/svrtf/parrtf.cxx, and SvRTFParser::_GetNextToken() calls SvRTFParser::ScanText() which parses the "\'8e" by treating it as 1 byte, in hexadecimal encoding. Other permissively licensed open-source projects like rtf.js do the same (https://github.com/tbluemel/rtf.js/blob/master/src/rtfjs/parser/Parser.ts#L422). And the RTF 1.0 spec from https://latex2rtf.sourceforge.net/RTF-Spec-1.0.txt confirms it: ---snip--- \'hh A hexadecimal value, based on the specified character set (may be used to identify 8-bit values). ---snip--- So "\'8e" becomes the byte 0x8e, but then how does that become "é"? What is this "specified character set"? The file begins with: 00000000 7b 5c 72 74 66 30 5c 6d 61 63 20 0d 7b 5c 63 6f |{\rtf0\mac .{\co| and the RTF spec says under the "THE CHARACTER SET" section: ---snip--- \mac Apple Macintosh ---snip--- The "\mac" should be parsed in SvRTFParser::Continue() where we have: ---snip--- 654 case RTF_MACTYPE: 655 SetEncoding( eCodeSet = RTL_TEXTENCODING_APPLE_ROMAN ); 656 break; ---snip--- as svtools/inc/svtools/rtfkeywd.hxx had: ---snip--- #define OOO_STRING_SVTOOLS_RTF_MAC "\\mac" ---snip--- Our character set conversions are generally done under main/sal/textenc, and in main/sal/textenc/tcvtlab1.tab we have: ---snip--- static sal_uInt16 const aImplAPPLEROMANToUniTab[APPLEROMANUNI_END - APPLEROMANUNI_START + 1] = { /* 0 1 2 3 4 5 6 7 */ /* 8 9 A B C D E F */ 0x00C4, 0x00C5, 0x00C7, 0x00C9, 0x00D1, 0x00D6, 0x00DC, 0x00E1, /* 0x80 */ 0x00E0, 0x00E2, 0x00E4, 0x00E3, 0x00E5, 0x00E7, 0x00E9, 0x00E8, /* 0x80 */ ---snip--- which would translate 0x8E into unicode 0x00E9, which is "é" (U+00E9), the expected character. But we got "Ž" (U+017D) instead in this sample document. Searching that file for "17D" we see it comes from the table for the MS 1252 encoding: ---snip--- static sal_uInt16 const aImplMS1252ToUniTab[MS1252UNI_END - MS1252UNI_START + 1] = { /* 0 1 2 3 4 5 6 7 */ /* 8 9 A B C D E F */ 0x20AC, 0, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, /* 0x80 */ 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0, 0x017D, 0, /* 0x80 */ ---snip--- and indeed, if we look at the constructor for SvRTFParser, we see that's the encoding it initially sets: ---snip--- SvRTFParser::SvRTFParser( SvStream& rIn, sal_uInt8 nStackSize ) : SvParser( rIn, nStackSize ), eUNICodeSet( RTL_TEXTENCODING_MS_1252 ), // default ist ANSI-CodeSet nUCharOverread( 1 ) { // default ist ANSI-CodeSet SetSrcEncoding( RTL_TEXTENCODING_MS_1252 ); bRTF_InTextRead = false; } ---snip--- But SvRTFParser::Continue() must be getting called after the constructor, and it seems to set the "mac" encoding, so why is the wrong encoding still used? -- You are receiving this mail because: You are the assignee for the issue.