[PATCH] D106577: [clang] Define __STDC_ISO_10646__

James Y Knight via Phabricator via cfe-commits Fri, 13 Aug 2021 08:09:10 -0700

jyknight added a comment.

In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith wrote:


>> One specific example I'd like to be considered:
>> Suppose the C standard library implementation's mbstowcs converts a certain 
>> multi-byte character C to somewhere in the Unicode private use area, because 
>> Unicode version N doesn't have a corresponding character. Suppose further 
>> that the compiler is aware of Unicode version N+1, in which a character 
>> corresponding to C was added. Is an implementation formed by that 
>> combination of compiler and standard library, that defines 
>> `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it 
>> represents character C as something other than the corresponding short name 
>> from Unicode version N+1?
>
> And David Keaton (long-time WG14 member and current convener) replied:
>
>> Yikes!  It does indeed sound like the library would affect the value of 
>> `__STDC_ISO_10646__` in that case.  Thanks for clarifying the details.
>
> There was no further discussion after that point, so I think the unofficial 
> WG14 stance is that the compiler and the library need to collude on setting 
> the value of that macro.

I don't think that scenario is valid. MBCS-to-unicode mappings are a part of 
the definition of the MBCS (sometimes officially, sometimes de-facto defined by 
major vendors), not in the definition of Unicode.

And in fact, we have a real-life example of this: the GB18030 encoding. That 
standard specifies 24 characters mappings to private-use-area unicode 
codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA 
mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new 
version of Unicode coming out will not affect that. Rather, I should say, DID 
NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 
were actually assigned official unicode codepoints by 2005 (some in Unicode 
3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA 
code-points. The only way that can change is if GB18030 gets updated.

I do note that some implementations (e.g. glibc) have taken it upon themselves 
to modify the official GB18030 character mapping table, and to decode those 24 
codepoints to the newly-defined unicode characters, instead of the specified 
PUA codepoints. But there's no way that can be described as a requirement -- 
it's not even technically correct!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Reply via email to