Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

Jason Merrill via Gcc-patches Thu, 01 Sep 2022 12:00:47 -0700

On 9/1/22 07:14, Jakub Jelinek wrote:

On Wed, Aug 31, 2022 at 12:14:22PM -0400, Jason Merrill wrote:

On 8/31/22 11:07, Jakub Jelinek wrote:

On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote:

It could be more explicit, but I think we can assume that from the existing
wording; it says it designates the named character.  If there is no such
character, that cannot be satisfied, so it must be ill-formed.

Ok.

So, we could reject the int h case above and accept silently the others?


Why not warn on the others?


We were always silent for the cases like \u123X or \U12345X.
Do you think we should emit some warnings (but never pedwarns/errors in that
case) that it is universal character name like but not completely?


I think that would be helpful, at least for \u{ and \N{.

Ok.

Given what you said above, I think that is what we want for the last 2
for C++23, the question is if it is ok also for C++20/C17 etc. and whether
it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode
or not in that case.  We could handle those 2 also differently, just
warn instead of error for the \N{ABC} case if not in C++23 mode when
identifier_pos.


That sounds right.

Here is an incremental version of the patch which will make valid
\u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU
modes before C++23 and split it as separate tokens in ISO modes.


Looks good.


Here is a patch which implements that.
I just wonder if we shouldn't have some warning option that would cover
these warnings, currently one needs to use -w to disable those warnings.

Apparently clang uses -Wunicode option to cover these, but unfortunately
they don't bother to document it (nor almost any other warning option),
so it is unclear what else exactly it covers.  Plus a question is how
we should document that option for GCC...

We might as well use the same flag name, and document it to mean what itcurrently means for GCC.

2022-09-01  Jakub Jelinek  <ja...@redhat.com>

        * charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't
        handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}.
        In possible identifier contexts, don't emit an error and punt
        if \N isn't followed by {, or if \N{} surrounds some lower case
        letters or _.  In possible identifier contexts when not C++23, don't
        emit an error but warning about unknown character names and treat as
        separate tokens.  When treating as separate tokens \u{ or \N{, emit
        warnings.

        * c-c++-common/cpp/delimited-escape-seq-4.c: New test.
        * c-c++-common/cpp/delimited-escape-seq-5.c: New test.
        * c-c++-common/cpp/named-universal-char-escape-5.c: New test.
        * c-c++-common/cpp/named-universal-char-escape-6.c: New test.
        * g++.dg/cpp23/named-universal-char-escape1.C: New test.
        * g++.dg/cpp23/named-universal-char-escape2.C: New test.

--- libcpp/charset.cc.jj        2022-09-01 09:47:24.146886929 +0200
+++ libcpp/charset.cc   2022-09-01 12:52:28.424034208 +0200
@@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const
    if (str[-1] == 'u')
      {
        length = 4;
-      if (str < limit && *str == '{')
+      if (str < limit
+         && *str == '{'
+         && (!identifier_pos
+             || CPP_OPTION (pfile, delimited_escape_seqs)
+             || !CPP_OPTION (pfile, std)))
        {
          str++;
          /* Magic value to indicate no digits seen.  */
@@ -1462,8 +1466,22 @@ _cpp_valid_ucn (cpp_reader *pfile, const
    else if (str[-1] == 'N')
      {
        length = 4;
+      if (identifier_pos
+         && !CPP_OPTION (pfile, delimited_escape_seqs)
+         && CPP_OPTION (pfile, std))
+       {
+         *cp = 0;
+         return false;
+       }
        if (str == limit || *str != '{')
-       cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+       {
+         if (identifier_pos)
+           {
+             *cp = 0;
+             return false;
+           }
+         cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+       }
        else
        {
          str++;
@@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const

if (str < limit && *str == '}')

            {
-             if (name == str && identifier_pos)
+             if (identifier_pos && (name == str || !strict))
                {
+                 if (name == str)
+                   cpp_warning (pfile, CPP_W_NONE,
+                                "empty named universal character escape "
+                                "sequence; treating it as separate tokens");
+                 else
+                   cpp_warning (pfile, CPP_W_NONE,
+                                "incomplete named universal character escape "
+                                "sequence; treating it as separate tokens");

It looks like this is handling \N{abc}, for which "incomplete" seemslike the wrong description; it's complete, just wrong, and thediagnostic doesn't help correct it.


I think we don't want to handle !strict here, but rather...

                  *cp = 0;
                  return false;
                }
@@ -1515,27 +1541,48 @@ _cpp_valid_ucn (cpp_reader *pfile, const
                                           uname2c_tree, NULL);
                  if (result == (cppchar_t) -1)
                    {
-                     cpp_error (pfile, CPP_DL_ERROR,
-                                "\\N{%.*s} is not a valid universal "
-                                "character", (int) (str - name), name);
+                     bool ret = true;
+                     if (identifier_pos
+                         && !CPP_OPTION (pfile, delimited_escape_seqs))


...here...

+                       ret = cpp_warning (pfile, CPP_W_NONE,
+                                          "\\N{%.*s} is not a valid "
+                                          "universal character; treating it "
+                                          "as separate tokens",
+                                          (int) (str - name), name);
+                     else
+                       cpp_error (pfile, CPP_DL_ERROR,
+                                  "\\N{%.*s} is not a valid universal "
+                                  "character", (int) (str - name), name);

/* Try to do a loose name lookup according to

                         Unicode loose matching rule UAX44-LM2.  */
                      char canon_name[uname2c_max_name_len + 1];
                      result = _cpp_uname2c_uax44_lm2 ((const char *) name,
                                                       str - name, canon_name);
-                     if (result != (cppchar_t) -1)
+                     if (result != (cppchar_t) -1 && ret)
                        cpp_error (pfile, CPP_DL_NOTE,
                                   "did you mean \\N{%s}?", canon_name);
                      else
-                       result = 0x40;
+                       result = 0xC0;
+                     if (identifier_pos
+                         && !CPP_OPTION (pfile, delimited_escape_seqs))


...and here.

+                       {
+                         *cp = 0;
+                         return false;
+                       }
                    }
                }
              str++;
              extend_char_range (char_range, loc_reader);
            }
          else if (identifier_pos)
-           length = 1;
+           {
+             cpp_warning (pfile, CPP_W_NONE,
+                          "incomplete named universal character escape "
+                          "sequence; treating it as separate tokens");
+             *cp = 0;
+             return false;
+           }
          else
            {
              cpp_error (pfile, CPP_DL_ERROR,
@@ -1584,12 +1631,17 @@ _cpp_valid_ucn (cpp_reader *pfile, const
        }
      while (--length);

- if (delimited

-      && str < limit
-      && *str == '}'
-      && (length != 32 || !identifier_pos))
+  if (delimited && str < limit && *str == '}')
      {
-      if (length == 32)
+      if (length == 32 && identifier_pos)
+       {
+         cpp_warning (pfile, CPP_W_NONE,
+                      "empty delimited escape sequence; "
+                      "treating it as separate tokens");
+         *cp = 0;
+         return false;
+       }
+      else if (length == 32)
        cpp_error (pfile, CPP_DL_ERROR,
                   "empty delimited escape sequence");
        else if (!CPP_OPTION (pfile, delimited_escape_seqs)
@@ -1607,6 +1659,10 @@ _cpp_valid_ucn (cpp_reader *pfile, const
       error message in that case.  */
    if (length && identifier_pos)
      {
+      if (delimited)
+       cpp_warning (pfile, CPP_W_NONE,
+                    "incomplete delimited escape sequence; "
+                    "treating it as separate tokens");
        *cp = 0;
        return false;
      }
--- gcc/testsuite/c-c++-common/cpp/delimited-escape-seq-4.c.jj  2022-09-01 
11:46:26.951102634 +0200
+++ gcc/testsuite/c-c++-common/cpp/delimited-escape-seq-4.c     2022-09-01 
12:50:17.351810090 +0200
@@ -0,0 +1,13 @@
+/* P2290R3 - Delimited escape sequences */
+/* { dg-do compile } */
+/* { dg-require-effective-target wchar } */
+/* { dg-options "-std=gnu99 -Wno-c++-compat" { target c } } */
+/* { dg-options "-std=gnu++20" { target c++ } } */
+
+#define z(x) 0
+#define a z(
+int b = a\u{});                /* { dg-warning "empty delimited escape sequence; 
treating it as separate tokens" } */
+int c = a\u{);         /* { dg-warning "incomplete delimited escape sequence; 
treating it as separate tokens" } */
+int d = a\u{12XYZ});   /* { dg-warning "incomplete delimited escape sequence; 
treating it as separate tokens" } */
+int e = a\u123);
+int f = a\U1234567);
--- gcc/testsuite/c-c++-common/cpp/delimited-escape-seq-5.c.jj  2022-09-01 
11:46:26.951102634 +0200
+++ gcc/testsuite/c-c++-common/cpp/delimited-escape-seq-5.c     2022-09-01 
12:57:50.830665924 +0200
@@ -0,0 +1,13 @@
+/* P2290R3 - Delimited escape sequences */
+/* { dg-do compile } */
+/* { dg-require-effective-target wchar } */
+/* { dg-options "-std=c17 -Wno-c++-compat" { target c } } */
+/* { dg-options "-std=c++23" { target c++ } } */
+
+#define z(x) 0
+#define a z(
+int b = a\u{});                /* { dg-warning "empty delimited escape sequence; treating it 
as separate tokens" "" { target c++23 } } */
+int c = a\u{);         /* { dg-warning "incomplete delimited escape sequence; treating it as 
separate tokens" "" { target c++23 } } */
+int d = a\u{12XYZ});   /* { dg-warning "incomplete delimited escape sequence; treating it as 
separate tokens" "" { target c++23 } } */
+int e = a\u123);
+int f = a\U1234567);
--- gcc/testsuite/c-c++-common/cpp/named-universal-char-escape-5.c.jj   
2022-09-01 11:46:26.951102634 +0200
+++ gcc/testsuite/c-c++-common/cpp/named-universal-char-escape-5.c      
2022-09-01 12:48:01.753647302 +0200
@@ -0,0 +1,15 @@
+/* P2071R2 - Named universal character escapes */
+/* { dg-do compile } */
+/* { dg-require-effective-target wchar } */
+/* { dg-options "-std=gnu99 -Wno-c++-compat" { target c } } */
+/* { dg-options "-std=gnu++20" { target c++ } } */
+
+#define z(x) 0
+#define a z(
+int b = a\N{});                                /* { dg-warning "empty named 
universal character escape sequence; treating it as separate tokens" } */
+int c = a\N{);                         /* { dg-warning "incomplete named universal 
character escape sequence; treating it as separate tokens" } */
+int d = a\N);
+int e = a\NARG);
+int f = a\N{abc});                     /* { dg-warning "incomplete named universal 
character escape sequence; treating it as separate tokens" } */
+int g = a\N{ABC.123});                 /* { dg-warning "incomplete named universal 
character escape sequence; treating it as separate tokens" } */
+int h = a\N{NON-EXISTENT CHAR});       /* { dg-warning "is not a valid universal 
character" } */
--- gcc/testsuite/c-c++-common/cpp/named-universal-char-escape-6.c.jj   
2022-09-01 11:46:26.951102634 +0200
+++ gcc/testsuite/c-c++-common/cpp/named-universal-char-escape-6.c      
2022-09-01 11:46:26.951102634 +0200
@@ -0,0 +1,16 @@
+/* P2071R2 - Named universal character escapes */
+/* { dg-do compile } */
+/* { dg-require-effective-target wchar } */
+/* { dg-options "-std=c17 -Wno-c++-compat" { target c } } */
+/* { dg-options "-std=c++20" { target c++ } } */
+
+#define z(x) 0
+#define a z(
+int b = a\N{});
+int c = a\N{);
+int d = a\N);
+int e = a\NARG);
+int f = a\N{abc});
+int g = a\N{ABC.123});
+int h = a\N{NON-EXISTENT CHAR});       /* { dg-bogus "is not a valid universal 
character" } */
+int i = a\N{LATIN SMALL CHARACTER A WITH ACUTE});
--- gcc/testsuite/g++.dg/cpp23/named-universal-char-escape1.C.jj        
2022-09-01 11:46:26.951102634 +0200
+++ gcc/testsuite/g++.dg/cpp23/named-universal-char-escape1.C   2022-09-01 
12:56:48.031516792 +0200
@@ -0,0 +1,14 @@
+// P2071R2 - Named universal character escapes
+// { dg-do compile }
+// { dg-require-effective-target wchar }
+
+#define z(x) 0
+#define a z(
+int b = a\N{});                                // { dg-warning "empty named universal 
character escape sequence; treating it as separate tokens" "" { target c++23 } }
+int c = a\N{);                         // { dg-warning "incomplete named universal character 
escape sequence; treating it as separate tokens" "" { target c++23 } }
+int d = a\N);
+int e = a\NARG);
+int f = a\N{abc});                     // { dg-warning "incomplete named universal character 
escape sequence; treating it as separate tokens" "" { target c++23 } }
+int g = a\N{ABC.123});                 // { dg-warning "incomplete named universal character 
escape sequence; treating it as separate tokens" "" { target c++23 } }
+int h = a\N{NON-EXISTENT CHAR});       // { dg-error "is not a valid universal 
character" "" { target c++23 } }
+                                       // { dg-error "was not declared in this scope" 
"" { target c++23 } .-1 }
--- gcc/testsuite/g++.dg/cpp23/named-universal-char-escape2.C.jj        
2022-09-01 12:54:47.436150733 +0200
+++ gcc/testsuite/g++.dg/cpp23/named-universal-char-escape2.C   2022-09-01 
12:59:22.512428644 +0200
@@ -0,0 +1,16 @@
+// P2071R2 - Named universal character escapes
+// { dg-do compile }
+// { dg-require-effective-target wchar }
+// { dg-options "" }
+
+#define z(x) 0
+#define a z(
+int b = a\N{});                                // { dg-warning "empty named 
universal character escape sequence; treating it as separate tokens" }
+int c = a\N{);                         // { dg-warning "incomplete named universal 
character escape sequence; treating it as separate tokens" }
+int d = a\N);
+int e = a\NARG);
+int f = a\N{abc});                     // { dg-warning "incomplete named universal 
character escape sequence; treating it as separate tokens" }
+int g = a\N{ABC.123});                 // { dg-warning "incomplete named universal 
character escape sequence; treating it as separate tokens" }
+int h = a\N{NON-EXISTENT CHAR});       // { dg-error "is not a valid universal 
character" "" { target c++23 } }
+                                       // { dg-error "was not declared in this scope" 
"" { target c++23 } .-1 }
+                                       // { dg-warning "is not a valid universal character; 
treating it as separate tokens" "" { target c++20_down } .-2 }


        Jakub

Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

Reply via email to