jyknight added a comment. In D106577#2944086 <https://reviews.llvm.org/D106577#2944086>, @aaron.ballman wrote:
>> I don't think that scenario is valid. MBCS-to-unicode mappings are a part of >> the definition of the MBCS (sometimes officially, sometimes de-facto defined >> by major vendors), not in the definition of Unicode. > > Isn't that scenario basically the one we're in today where the compiler is > unaware of what mappings the library provides? What I mean is: unicode does not define the mappings of a legacy MBCS byte sequence to a unicode character. It's simply out of scope. Only 3 encodings are defined by the Unicode standard (UTF-8, UTF-16, UTF-32). Mappings for other encodings are defined, instead, either by their own standard, or else simply chosen arbitrarily by a vendor. >> And in fact, we have a real-life example of this: the GB18030 encoding. That >> standard specifies 24 characters mappings to private-use-area unicode >> codepoints in the most recent version, GB18030-2005. (Which is down from 80 >> PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, >> a new version of Unicode coming out will not affect that. Rather, I should >> say, DID NOT affect that -- all of those 24 characters mapped to PUAs in >> GB18030-2005 were actually assigned official unicode codepoints by 2005 >> (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still >> maps those to PUA code-points. The only way that can change is if GB18030 >> gets updated. >> >> I do note that some implementations (e.g. glibc) have taken it upon >> themselves to modify the official GB18030 character mapping table, and to >> decode those 24 codepoints to the newly-defined unicode characters, instead >> of the specified PUA codepoints. But there's no way that can be described as >> a requirement -- it's not even technically correct! > > Does that imply that an implementation supporting that encoding can't define > __STDC_ISO_10646__ because it doesn't meet the "has the same value as the > short identifier" requirement? No. The fact that the GB18030 encoding has an unfortunate mapping of its bytes to unicode characters does not change anything about `__STD_ISO_10646__`. It does not affect, "every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character" at all. All we're talking about here is differences of opinion between implementations as to which unicode character a given GB18030 byte sequence should to be translated as -- not the way in which a unicode character is stored in a wchar_t. > @jyknight, are you on the WG14 reflectors btw? Would you like to carry on > with this discussion over there (or would you like me to convey your > viewpoints on your behalf)? I'm not. I'd be happy to have you convey my viewpoints. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D106577/new/ https://reviews.llvm.org/D106577 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits