jyknight added a comment. In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith wrote:
>> One specific example I'd like to be considered: >> Suppose the C standard library implementation's mbstowcs converts a certain >> multi-byte character C to somewhere in the Unicode private use area, because >> Unicode version N doesn't have a corresponding character. Suppose further >> that the compiler is aware of Unicode version N+1, in which a character >> corresponding to C was added. Is an implementation formed by that >> combination of compiler and standard library, that defines >> `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it >> represents character C as something other than the corresponding short name >> from Unicode version N+1? > > And David Keaton (long-time WG14 member and current convener) replied: > >> Yikes! It does indeed sound like the library would affect the value of >> `__STDC_ISO_10646__` in that case. Thanks for clarifying the details. > > There was no further discussion after that point, so I think the unofficial > WG14 stance is that the compiler and the library need to collude on setting > the value of that macro. I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode. And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated. I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct! Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D106577/new/ https://reviews.llvm.org/D106577 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits