[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Hubert Tong via Phabricator via cfe-commits Sat, 12 Aug 2023 16:16:51 -0700

hubert.reinterpretcast added inline comments.


================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16899
+          uint32_t CodePoint = 
static_cast<uint32_t>(V.getInt().getZExtValue());
+          PrintCharLiteralPrefix(BTy->getKind(), OS);
+          OS << '\'';
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > hubert.reinterpretcast wrote:
> > > cor3ntin wrote:
> > > > Looking at the diagnostics, I don't think it makes sense to print a 
> > > > prefix here. You could just leave that part out.
> > > Why is removing the prefix better? The types can matter (characters 
> > > outside the basic character set are allowed to have negative `char` 
> > > values). Also, moving forward, the value of a character need not be the 
> > > same in the various encodings.
> > Some fun with signedness (imagine a more realistic example with 
> > `ISO-8859-1` ordinary character encoding with a signed `char` type):
> > ```
> > $ clang -Xclang -fwchar-type=short -xc++ -<<<$'static_assert(L"\\uFF10"[0] 
> > == U\'\\uFF10\');'
> > <stdin>:1:15: error: static assertion failed due to requirement 
> > 'L"\xFF10"[0] == U'\uff10''
> >     1 | static_assert(L"\uFF10"[0] == U'\uFF10');
> >       |               ^~~~~~~~~~~~~~~~~~~~~~~~~
> > <stdin>:1:28: note: expression evaluates to ''０' (0xFF10) == '０' (0xFF10)'
> >     1 | static_assert(L"\uFF10"[0] == U'\uFF10');
> >       |               ~~~~~~~~~~~~~^~~~~~~~~~~~
> > 1 error generated.
> > Return:  0x01:1   Fri Aug 11 23:49:02 2023 EDT
> > ```
> Either we care about the actual character - ie `'a'`, or it's value (ie 
> `42`). The motivation for the current patch is to add the value to the 
> diagnostic message.
> I'm also concerned about mixing things that are are are not lexical elements 
> in the diagnostics
Maybe the //motivation// for the current patch is to add the value, but what it 
does (for wide characters as defined in C) is to add the character (and 
obfuscate the value).

Observe the status quo (https://godbolt.org/z/Wc6nKvTMn):
```
note: expression evaluates to '-240 == 65296'
```

From the output higher up (with this patch), we see two "identical" characters 
and values (due to lack of decimal value output). With decimal value output 
added, it will still be potentially confusing why the two identical characters 
have different values (without some sort of type annotation).

I admit that the confusion arises in the status quo treatment of `signed char` 
and `unsigned char`. I hope I am using the word correctly when I say that it is 
ironic that the patch breaks in one context what it seeks to fix in another.



================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16869
+/// The code point needs to be zero-extended to 32-bits.
+void ConvertCharToString(uint32_t CodePoint, const BuiltinType *BTy,
+                         unsigned TyWidth, llvm::raw_ostream &OS) {
----------------
It does not seem that the first parameter expects a `CodePoint` argument in all 
cases. For `Char_S`, `Char_U`, and `Char8`, it seems the function wants to 
treat the input as a UTF-8 code unit.

I suggest changing the argument to be clearly a code unit (and potentially 
treat it as a code point value as appropriate later in the function).

Also: The function should probably be declared as having static linkage.
Additionally: The function does not "convert" in the language semantic sense. 
`WriteCharacterValueDescriptionForDisplay` might be a better name.


================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16876
+  // other types.
+  if (CodePoint <= UCHAR_MAX) {
+    StringRef Escaped = escapeCStyle<EscapeChar::Single>(CodePoint);
----------------
For types other than `Char_S`, `Char_U`, and `Char8`, this fails to treat the 
C1 Controls and Latin-1 Supplement characters as Unicode code points. It looks 
like test coverage for these cases are missing.


================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16930-16936
+        case BuiltinType::Char_S:
+        case BuiltinType::Char_U:
+        case BuiltinType::Char8:
+        case BuiltinType::Char16:
+        case BuiltinType::Char32:
+        case BuiltinType::WChar_S:
+        case BuiltinType::WChar_U: {
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > Add FIXME for `char` and `wchar_t` cases that this assumes Unicode literal 
> > encodings.
> If we wanted such fixme, it should be L1689.
The `ConvertCharToString` has a first parameter called `CodePoint`. With that 
interface[^1], it is sensible to insert conversion from the applicable literal 
encoding to a Unicode code point value here (thus my request for a FIXME here).

You are probably right that the FIXME belongs elsewhere. If you were thinking 
what I am thinking, then I am guessing you meant L16894? That is where the 
`ConvertCharToString` function seems to assume that a `wchar_t` value is 
directly a "code point value". To generate hex escapes, the function needs to 
be passed the original value (including for `char`s, e.g., to handle stray code 
units). Once the interface is updated (i.e., the parameter is renamed), the 
`ConvertCharToString` function would more clearly be the place to put one or 
more FIXMEs about encoding assumptions.

[^1]: It turns out that the parameter is already not treated consistently as a 
code point value within the function (and by the caller) and the parameter is 
just badly named.



================
Comment at: clang/test/SemaCXX/static-assert.cpp:287
+  static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error 
{{failed}} \
+                                                  // expected-note {{evaluates 
to ''ゆ' (0x3086) == '̵' (0x335)'}}
+  static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > The C++23 escaped string formatting facility would not generate a trailing 
> > combining character like this. I recommend following suit.
> > 
> > Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335
> > 
> This is way outside the scope of the patch. The diagnostic output facility 
> has no understanding of combining characters or graphemes and do not attempt 
> to match std::print. It probably would be an improvement but this patch is 
> not trying to modify how all diagnostics are printed. (all of that logic is 
> in Diagnostic.cpp)
This patch is pushing the envelope of what appears in diagnostics. One can also 
argue that someone writing
```
static_assert(false, "\u0301");
```
gets what they deserve, but that case does not have a big problem anyway 
(because the provided message text appears after `: `).

This patch increases the exposure of the diagnostic output facility to input 
that it does not handle well. I disagree that it is outside the scope of this 
patch to insist that it does not generate such inputs to the diagnostic output 
facility (even if a possible solution is to modify the diagnostic output 
facility first).


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155610/new/

https://reviews.llvm.org/D155610

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Reply via email to