[Issue 103308] HTML import mangles non-BMP unicodes

bugzilla Tue, 03 Jan 2023 01:37:16 -0800

https://bz.apache.org/ooo/show_bug.cgi?id=103308


dam...@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Latest|---                         |4.2.0-dev
    Confirmation in|                            |
                 CC|                            |dam...@apache.org

--- Comment #5 from dam...@apache.org ---
(In reply to h...@apache.org from comment #2)
> Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in
> sw/source/filter/html/parcss1.cxx is 
> probably a good starting point.

Yes but that's just CSS parsing, the remainder of the HTML parsing is in
main/svtools/source/svhtml/parhtml.cxx, which, sadly like most of our codebase,
also operates one Unicode code unit at a time, retrieved from 
SvParser::GetNextChar().

The function
inline sal_uInt16 GetCharSize() const;
got my hopes up, does it tell us the code point size?

inline sal_uInt16 SvParser::GetCharSize() const
{
    return (RTL_TEXTENCODING_UCS2 == eSrcEnc) ? 2 : 1;
}

No, just the bytes per BMP character for the current encoding, a useless
statistic.

SvParser does not have any functions for code points. We'd have to add them and
change a lot of code - not just HTML parsing - to use them.

-- 
You are receiving this mail because:
You are on the CC list for the issue.
You are the assignee for the issue.

[Issue 103308] HTML import mangles non-BMP unicodes

Reply via email to