[PATCH] D123064: [Clang][C++23][WIP] P2071 Named universal character escapes

Corentin Jabot via Phabricator via cfe-commits Wed, 06 Apr 2022 12:39:55 -0700

cor3ntin added a subscriber: aaron.ballman.
cor3ntin added inline comments.



================
Comment at: clang/include/clang/Basic/DiagnosticLexKinds.td:130-131
 
 def ext_delimited_escape_sequence : Extension<
-  "delimited escape sequences are a Clang extension">,
+  "%select{delimited|named}0 escape sequences are a Clang extension">,
   InGroup<DiagGroup<"delimited-escape-sequence-extension">>;
----------------
tahonermann wrote:
> I don't see much value in combining these diagnostics since these are 
> distinct features. The `ext_delimited_escape_sequence` name seems odd for 
> named escape sequences too (even if both features use `{` and `}` as 
> delimiters).
I'm used to @aaron.ballman encouraging me to keep the number of diagnostics 
under control, but I'm fine keeping them separate 


================
Comment at: clang/lib/Lex/Lexer.cpp:3255-3260
+    if (!isAlphanumeric(C) && C != '_' && C != '-' && C != ' ')
+      break;
+
+    if ((C < 'A' || C > 'Z') && !llvm::isDigit(C) && C != ' ' && C != '-') {
+      Invalid = true;
+    }
----------------
tahonermann wrote:
> It isn't clear to me why there are two separate `if` statements here. I would 
> expect the following to suffice. If I'm misunderstanding something, then 
> comments might be helpful.
>   if (!isUppercase(C) && !isDigit(C) && C != '-' && C != ' ') {
>     Invalid = true;
>     break;
>   }
> 
> I'm not sure why there is a test for '_' since that character is not present 
> in the grammar for `n-char` in P2071.
> 
> Is it intentional that there is no `break` statement in the second `if` 
> statement?
I improved that, what it does should be more clear now. More importantly, I 
added a diagnostic note when we detect a loose match.

We allow `_` because  Unicode does.
We first perform a strict match - which fails as no Unicode name contains an 
underscore, we emit a diagnostic, and then we try a loose matching which does 
allow `_`.
This enable us to produces better diagnostics

```
<source>:2:18: error: 'GREEK_SMALL_LETTER-OMICRON' is not a valid Unicode 
character name
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break 
space}"; 
                 ^
<source>:2:20: note: characters names in Unicode escape sequences are sensitive 
to case and whitespaces
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break 
space}"; 
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
                   GREEK SMALL LETTER OMICRON
<source>:2:54: error: 'zero width no break space' is not a valid Unicode 
character name
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break 
space}"; 
                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~
<source>:2:54: note: characters names in Unicode escape sequences are sensitive 
to case and whitespaces
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break 
space}"; 
                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~
                                                     ZERO WIDTH NO-BREAK 
SPACE```



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123064/new/

https://reviews.llvm.org/D123064

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D123064: [Clang][C++23][WIP] P2071 Named universal character escapes

Reply via email to