[Patch]multibyte chracter decoding issue when importing RTF document

Hung Mark Sat, 23 Aug 2014 21:55:55 -0700

The issue has been submitted to Bugzilla
https://issues.apache.org/ooo/show_bug.cgi?id=125495



When importing a RTF file with Chinese numbering created by MSO, the
numbering suffix were changed to strange characters ( like B, i ), as
attached image file.

After following the code trace on gdb, I saw that encoding of parserstate
return to default in the middle, so that multibyte string were treated as
ANSI strings.

Because codepage encoding options like \ansicp950 appears later than the
first bracket '{', the first parsing state has been pushed into the stack
before correct encoding were set. Later when it was popped, the encoding of
later state were affected and become the default even if \ansicp950 already
appears, in consequence it affect multibyte string conversion for text
token.

The fix is to call setEncoding instead of setSrcEncoding when seeing
encoding related control word.

Updated code will overwrite the encoding of the state on top of the frame.
Since setEncoding is there without anybody calling it, I wonder if it is
typo of original author.
The patch has been verified to work in my environment.

In theory , all multibyte chracter encoded documents were affected.

Please help to review & merge if possible.


-- 
Mark Hung

[Patch]multibyte chracter decoding issue when importing RTF document

Reply via email to