[PATCH] D106577: [clang] Define __STDC_ISO_10646__

James Y Knight via Phabricator via cfe-commits Fri, 20 Aug 2021 22:28:18 -0700

jyknight added a comment.

In D106577#2944086 <https://reviews.llvm.org/D106577#2944086>, @aaron.ballman 
wrote:


>> I don't think that scenario is valid. MBCS-to-unicode mappings are a part of 
>> the definition of the MBCS (sometimes officially, sometimes de-facto defined 
>> by major vendors), not in the definition of Unicode.
>
> Isn't that scenario basically the one we're in today where the compiler is 
> unaware of what mappings the library provides?

What I mean is: unicode does not define the mappings of a legacy MBCS byte 
sequence to a unicode character. It's simply out of scope. Only 3 encodings are 
defined by the Unicode standard (UTF-8, UTF-16, UTF-32). Mappings for other 
encodings are defined, instead, either by their own standard, or else simply 
chosen arbitrarily by a vendor.

>> And in fact, we have a real-life example of this: the GB18030 encoding. That 
>> standard specifies 24 characters mappings to private-use-area unicode 
>> codepoints in the most recent version, GB18030-2005. (Which is down from 80 
>> PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, 
>> a new version of Unicode coming out will not affect that. Rather, I should 
>> say, DID NOT affect that -- all of those 24 characters mapped to PUAs in 
>> GB18030-2005 were actually assigned official unicode codepoints by 2005 
>> (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still 
>> maps those to PUA code-points. The only way that can change is if GB18030 
>> gets updated.
>>
>> I do note that some implementations (e.g. glibc) have taken it upon 
>> themselves to modify the official GB18030 character mapping table, and to 
>> decode those 24 codepoints to the newly-defined unicode characters, instead 
>> of the specified PUA codepoints. But there's no way that can be described as 
>> a requirement -- it's not even technically correct!
>
> Does that imply that an implementation supporting that encoding can't define 
> __STDC_ISO_10646__ because it doesn't meet the "has the same value as the 
> short identifier" requirement?

No. The fact that the GB18030 encoding has an unfortunate mapping of its bytes 
to unicode characters does not change anything about `__STD_ISO_10646__`. It 
does not affect, "every character in the Unicode required set, when stored in 
an object of type wchar_t, has the same value as the short identifier of that 
character" at all. All we're talking about here is differences of opinion 
between implementations as to which unicode character a given GB18030 byte 
sequence should to be translated as -- not the way in which a unicode character 
is stored in a wchar_t.

> @jyknight, are you on the WG14 reflectors btw? Would you like to carry on 
> with this discussion over there (or would you like me to convey your 
> viewpoints on your behalf)?

I'm not. I'd be happy to have you convey my viewpoints.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Reply via email to