[PATCH] D123064: [Clang][C++23] P2071 Named universal character escapes

Aaron Ballman via Phabricator via cfe-commits Mon, 11 Apr 2022 09:08:42 -0700

aaron.ballman added reviewers: aaron.ballman, tahonermann, clang-language-wg.
aaron.ballman added subscribers: tstellar, lattner.
aaron.ballman added a comment.


Thank you for this functionality! I did a pretty quick pass over it and have 
some comments.



================
Comment at: clang/include/clang/Basic/DiagnosticLexKinds.td:130-131
 
 def ext_delimited_escape_sequence : Extension<
-  "delimited escape sequences are a Clang extension">,
+  "%select{delimited|named}0 escape sequences are a Clang extension">,
   InGroup<DiagGroup<"delimited-escape-sequence-extension">>;
----------------
cor3ntin wrote:
> tahonermann wrote:
> > I don't see much value in combining these diagnostics since these are 
> > distinct features. The `ext_delimited_escape_sequence` name seems odd for 
> > named escape sequences too (even if both features use `{` and `}` as 
> > delimiters).
> I'm used to @aaron.ballman encouraging me to keep the number of diagnostics 
> under control, but I'm fine keeping them separate 
While Tom's right that they're distinct features, diagnostics aren't tied 
directly to a feature in Clang; we try to reuse existing diagnostic text as 
much as possible. My preference is to keep them combined so we reduce the 
number of strings we build into the executable (and we make life easier for 
eventual localization of the diagnostics).


================
Comment at: clang/lib/Lex/Lexer.cpp:3139
       Diag(SlashLoc, diag::warn_ucn_not_valid_in_c89);
-    return 0;
+    return {};
   }
----------------
FWIW, I think using `llvm::None` instead of `{}` is more clear to readers (I 
doubt we're consistent with this in the code base though).

(Same comment applies other places that we're making an empty `Optional`.)


================
Comment at: clang/lib/Lex/Lexer.cpp:3226
+llvm::Optional<uint32_t> Lexer::tryReadNamedUCN(const char *&StartPtr,
+                                                const char *, Token *Result) {
+  unsigned CharSize;
----------------



================
Comment at: clang/lib/Lex/Lexer.cpp:3237-3239
+  if (C != '{') {
+    if (Diagnose)
+      Diag(StartPtr, diag::warn_ucn_escape_incomplete);
----------------
Is this a case where we might want a fixit because the user did `\N` when they 
meant to do `\n`?

(Should we look for `\N(` and fixit that to `\N{`?)


================
Comment at: clang/lib/Lex/Lexer.cpp:3290
+    // and we should not make invalid suggestions.
+  }
+
----------------
I think it'd be more clear to set `Res = LooseMatch->CodePoint;` before exiting 
here so that after this block, you can always assume there's a valid code point 
(this helps if we later want to add more logic after this point, too).


================
Comment at: clang/lib/Lex/Lexer.cpp:3292-3294
+  if (Diagnose && PP && !LooseMatch) {
+    Diag(BufferPtr, diag::ext_delimited_escape_sequence) << /*named*/ 1;
+  }
----------------



================
Comment at: clang/lib/Lex/Lexer.cpp:3305
+  }
+  return Res ? *Res : LooseMatch->CodePoint;
+}
----------------
Then this can become `return *Res;`


================
Comment at: clang/lib/Lex/Lexer.cpp:3315-3317
+    CodePointOpt = tryReadNumericUCN(StartPtr, SlashLoc, Result);
+
+  else if (Kind == 'N')
----------------



================
Comment at: clang/lib/Lex/Lexer.cpp:3325-3327
+  if (Result) {
+    Result->setFlag(Token::HasUCN);
+  }
----------------
I think this should live inside the other `tryRead*` functions once we know the 
token is definitely a UCN, in case someone finds a reason to want to call one 
of those concretely but not call `tryReadUCN()` to do so.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+             diag::err_delimited_escape_missing_brace)
+            << std::string(1, 'o');
 
----------------
Can you use `"o"` instead of constructing a `std::string` for it?


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:362
+      assert(Delim != Input.end());
+      auto Res = llvm::sys::unicode::nameToCodepointLooseMatching(
+          StringRef(I, std::distance(I, Delim)));
----------------
Please spell out the type for `Res` as it's not obvious from context what it is.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:400-406
   const char *UcnBegin = ThisTokBuf;
-
   // Skip the '\u' char's.
   ThisTokBuf += 2;
-
-  bool Delimited = false;
-  bool EndDelimiterFound = false;
   bool HasError = false;
 
+  Delimited = false;
+  bool EndDelimiterFound = false;
----------------
Might as well clean this up a bit more


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:488
+
+  auto Res = llvm::sys::unicode::nameToCodepointLooseMatching(Name);
+  if (Res) {
----------------
Please spell the type out.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:500
+  unsigned Distance = 0;
+  auto Matches = llvm::sys::unicode::nearestMatchesForCodepointName(Name, 5);
+  assert(!Matches.empty() && "No unicode characters found");
----------------
Same here.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:501
+  auto Matches = llvm::sys::unicode::nearestMatchesForCodepointName(Name, 5);
+  assert(!Matches.empty() && "No unicode characters found");
+
----------------
Just double checking that the assertion here is valid and the function can 
never return an empty set.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:510
+
+    std::string s;
+    llvm::UTF32 V = Match.Value;
----------------



================
Comment at: clang/lib/Lex/LiteralSupport.cpp:564
+  ThisTokBuf = ClosingBrace + 1;
+  auto Res = llvm::sys::unicode::nameToCodepointStrict(Name);
+  if (!Res) {
----------------
Spell out the type.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:593-597
+  } else
+    HasError =
+        !ProcessNumericUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, UcnVal,
+                                 UcnLen, IsDelimitedEscapeSequence, Loc, Diags,
+                                 Features, in_char_string_literal);
----------------
May as well be consistent with the first branch, esp given that this is a 
multiline substatement.


================
Comment at: clang/test/Lexer/char-escapes-delimited.c:85
+  char d = '\N{}';          // expected-error {{delimited escape sequence 
cannot be empty}}
+  char e = '\N{';           // expected-error {{incomplete universal character 
name}}
+
----------------
Here's another fun test:
```
char trigraphs = '\N??<DOLLAR SIGN??>`;
```
and we should test that this magic works in the preprocessor:
```
if `\N{LATIN SMALL LETTER A}` != 'a'
#error "oh no!"
#endif
```

Might be worthwhile to have a preprocessor test to show that you can't form a 
named UCN through token concatenation.


================
Comment at: clang/test/Lexer/char-escapes-delimited.c:94
+  unsigned i = u'\N{GREEK CAPITAL LETTER DELTA}'; // expected-warning 
{{extension}}
+  char j = '\NN';                                 // expected-error {{expected 
'{' after '\N' escape sequence}}
+  unsigned k = u'\N{LOTUS';                       // expected-error 
{{incomplete universal character name}}
----------------
This has the potential to break user code that was doing `\N`: 
https://godbolt.org/z/rahbsssPo However, it's consistent with how we handle 
`\U`, `\x`, and `\u`, so it's defensible.


================
Comment at: llvm/include/llvm/ADT/StringExtras.h:329
 
+std::string to_hexString(uint64_t Value, bool UpperCase = true);
+
----------------
`utohexstr()` already exists on line 152 -- any reason we can't reuse that?


================
Comment at: llvm/include/llvm/Support/Unicode.h:19
+#include "llvm/ADT/SmallString.h"
+#include <vector>
+
----------------
No need to include vector?


================
Comment at: llvm/include/llvm/Support/Unicode.h:77-94
+llvm::Optional<char32_t> nameToCodepointStrict(StringRef Name);
+
+struct LooseMatchingResult {
+  char32_t CodePoint;
+  llvm::SmallString<64> Name;
+};
+
----------------
Already in the `llvm` namespace so no need for fully qualified names.


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:31-44
+struct node {
+  bool is_root = false;
+  char32_t value = 0xFFFFFFFF;
+  uint32_t children_offset = 0;
+  bool has_sibling = false;
+  uint32_t size = 0;
+  StringRef name = {};
----------------
You should rename things to match the usual coding conventions.


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:37
+  uint32_t size = 0;
+  StringRef name = {};
+  const node *parent = nullptr;
----------------



================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:49
+    std::string s;
+    s.reserve(64);
+    auto n = this;
----------------
Any particular reason for 64?


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:50
+    s.reserve(64);
+    auto n = this;
+    while (n) {
----------------
(not like `auto` vs `node` saves you much typing.)


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:72-80
+  const uint32_t origin = offset;
+  node n;
+  n.parent = parent;
+  uint8_t nameInfo = UnicodeNameToCodepointIndex[offset++];
+  if (offset + 6 >= UnicodeNameToCodepointIndexSize)
+    return n;
+
----------------
(We don't typically use top-level const qualification.)


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:187-189
+  if (!DoesStartWith) {
+    return {n, false, 0};
+  }
----------------



================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:191-193
+  if (name.size() - Consummed == 0 && n.value != 0xFFFFFFFF) {
+    return {n, true, n.value};
+  }
----------------



================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:195
+  if (n.has_children()) {
+    auto o = n.children_offset;
+    for (;;) {
----------------
Please spell out the type (I'll stop commenting on the style nits -- you should 
do a pass for all that stuff at some point).


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:257
+constexpr const char32_t SBase = 0xAC00;
+// constexpr const char32_t LBase = 0x1100;
+// constexpr const char32_t VBase = 0x1161;
----------------
Plan to remove the commented out code?


================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:364
+    unsigned long long v = 0;
+    // Be consistent about mandating uper casing
+    if (Strict &&
----------------



================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:379-384
+static llvm::Optional<char32_t> nameToCodepoint(StringRef name, bool Strict,
+                                                BufferType &Buffer) {
+
+  if (name.empty())
+    return {};
+  llvm::Optional<char32_t> Res = nameToHangulCodePoint(name, Strict, Buffer);
----------------



================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:409-411
+llvm::Optional<char32_t> nameToCodepointStrict(StringRef name) {
+
+  BufferType Buffer;
----------------



================
Comment at: llvm/lib/Support/UnicodeNameToCodepointGenerated.cpp:4
+// This file was generated using ./bin/UnicodeNameMappingGenerator. Do not edit
+// manually.//
+//===----------------------------------------------------------------------===//
----------------
Looks like the generator needs to add some whitespace or drop the trailing //.


================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:17-18
+// List of generated names
+// Should be kept in sync with Unicode
+// "Name Derivation Rule Prefix String".
+static bool generated(char32_t c) {
----------------
Do we have something more direct to point users towards?


================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:339-342
+  printf("NameAliases.txt can be found at "
+         "https://unicode.org/Public/14.0.0/ucd/NameAliases.txt\n";
+         "UnicodeData.txt can be found at "
+         "https://unicode.org/Public/14.0.0/ucd/UnicodeData.txt\n\n";);
----------------
This raises a bit of a question about licensing -- we're making a derivative 
work from Unicode, so I think https://www.unicode.org/license.txt kicks in and 
we need to update our own license.txt files accordingly.

@tstellar @lattner -- can one of you weigh in on this? (tl;dr: we use this new 
tool to download data from Unicode.org and use it to generate 
llvm/lib/Support/UnicodeNameToCodepointGenerated.cpp which is committed into 
the source tree. The feature being implemented here is a C++23 feature and we 
cannot implement it without this data.)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123064/new/

https://reviews.llvm.org/D123064

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D123064: [Clang][C++23] P2071 Named universal character escapes

Reply via email to