Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-09-01 Thread Jakub Jelinek via Gcc-patches
On Thu, Sep 01, 2022 at 03:00:28PM -0400, Jason Merrill wrote:
> > Apparently clang uses -Wunicode option to cover these, but unfortunately
> > they don't bother to document it (nor almost any other warning option),
> > so it is unclear what else exactly it covers.  Plus a question is how
> > we should document that option for GCC...
> 
> We might as well use the same flag name, and document it to mean what it
> currently means for GCC.

Ok, will work on that tomorrow.

> > @@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const
> >   if (str < limit && *str == '}')
> > {
> > - if (name == str && identifier_pos)
> > + if (identifier_pos && (name == str || !strict))
> > {
> > + if (name == str)
> > +   cpp_warning (pfile, CPP_W_NONE,
> > +"empty named universal character escape "
> > +"sequence; treating it as separate tokens");
> > + else
> > +   cpp_warning (pfile, CPP_W_NONE,
> > +"incomplete named universal character escape "
> > +"sequence; treating it as separate tokens");
> 
> It looks like this is handling \N{abc}, for which "incomplete" seems like
> the wrong description; it's complete, just wrong, and the diagnostic doesn't
> help correct it.

The point is to make it more consistent with the \N{X.1} handling.
The grammar is clear that only upper case letters + digits + space + hyphen
can appear in between \N{ and }.  So, both of those cases IMHO should be
handled the same.  The !strict case is if there is at least one lower case
letter or underscore but no other characters than letters + digits + space +
hyphen + underscore, we then find the terminating } and inside of
string/character literals want to do the UAX44LM2 algorithm suggestions.
But for X.1 in literals we don't even look for }, we just emit the
  cpp_error (pfile, CPP_DL_ERROR,
 "'\\N{' not terminated with '}' after %.*s",
 (int) (str - base), base);
diagnostics which prints after X
For the identifier_pos case, both the !strict and *str != '}' cases
are the same reason why it is treated as separate tokens, not because
the name is not valid, but because it contains invalid characters.
So perhaps for the identifier_pos !strict and *str != '}' cases
we could emit a warning with the same wording as above (but so that
we stop for !strict on the first lowercase or _ char just break instead
of set strict = true if identifier_pos).
Or we could emit such a warning and a note that would clarify that only
upper case letters, digits, space or hyphen are allowed there?

Jakub



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-09-01 Thread Jason Merrill via Gcc-patches

On 9/1/22 07:14, Jakub Jelinek wrote:

On Wed, Aug 31, 2022 at 12:14:22PM -0400, Jason Merrill wrote:

On 8/31/22 11:07, Jakub Jelinek wrote:

On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote:

It could be more explicit, but I think we can assume that from the existing
wording; it says it designates the named character.  If there is no such
character, that cannot be satisfied, so it must be ill-formed.


Ok.


So, we could reject the int h case above and accept silently the others?


Why not warn on the others?


We were always silent for the cases like \u123X or \U12345X.
Do you think we should emit some warnings (but never pedwarns/errors in that
case) that it is universal character name like but not completely?


I think that would be helpful, at least for \u{ and \N{.


Ok.


Given what you said above, I think that is what we want for the last 2
for C++23, the question is if it is ok also for C++20/C17 etc. and whether
it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode
or not in that case.  We could handle those 2 also differently, just
warn instead of error for the \N{ABC} case if not in C++23 mode when
identifier_pos.


That sounds right.


Here is an incremental version of the patch which will make valid
\u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU
modes before C++23 and split it as separate tokens in ISO modes.


Looks good.


Here is a patch which implements that.
I just wonder if we shouldn't have some warning option that would cover
these warnings, currently one needs to use -w to disable those warnings.

Apparently clang uses -Wunicode option to cover these, but unfortunately
they don't bother to document it (nor almost any other warning option),
so it is unclear what else exactly it covers.  Plus a question is how
we should document that option for GCC...


We might as well use the same flag name, and document it to mean what it 
currently means for GCC.



2022-09-01  Jakub Jelinek  

* charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't
handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}.
In possible identifier contexts, don't emit an error and punt
if \N isn't followed by {, or if \N{} surrounds some lower case
letters or _.  In possible identifier contexts when not C++23, don't
emit an error but warning about unknown character names and treat as
separate tokens.  When treating as separate tokens \u{ or \N{, emit
warnings.

* c-c++-common/cpp/delimited-escape-seq-4.c: New test.
* c-c++-common/cpp/delimited-escape-seq-5.c: New test.
* c-c++-common/cpp/named-universal-char-escape-5.c: New test.
* c-c++-common/cpp/named-universal-char-escape-6.c: New test.
* g++.dg/cpp23/named-universal-char-escape1.C: New test.
* g++.dg/cpp23/named-universal-char-escape2.C: New test.

--- libcpp/charset.cc.jj2022-09-01 09:47:24.146886929 +0200
+++ libcpp/charset.cc   2022-09-01 12:52:28.424034208 +0200
@@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const
if (str[-1] == 'u')
  {
length = 4;
-  if (str < limit && *str == '{')
+  if (str < limit
+ && *str == '{'
+ && (!identifier_pos
+ || CPP_OPTION (pfile, delimited_escape_seqs)
+ || !CPP_OPTION (pfile, std)))
{
  str++;
  /* Magic value to indicate no digits seen.  */
@@ -1462,8 +1466,22 @@ _cpp_valid_ucn (cpp_reader *pfile, const
else if (str[-1] == 'N')
  {
length = 4;
+  if (identifier_pos
+ && !CPP_OPTION (pfile, delimited_escape_seqs)
+ && CPP_OPTION (pfile, std))
+   {
+ *cp = 0;
+ return false;
+   }
if (str == limit || *str != '{')
-   cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+   {
+ if (identifier_pos)
+   {
+ *cp = 0;
+ return false;
+   }
+ cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+   }
else
{
  str++;
@@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const
  
  	  if (str < limit && *str == '}')

{
- if (name == str && identifier_pos)
+ if (identifier_pos && (name == str || !strict))
{
+ if (name == str)
+   cpp_warning (pfile, CPP_W_NONE,
+"empty named universal character escape "
+"sequence; treating it as separate tokens");
+ else
+   cpp_warning (pfile, CPP_W_NONE,
+"incomplete named universal character escape "
+"sequence; treating it as separate tokens");


It looks like this is handling \N{abc}, for which "incomplete" seems 
like the wrong description; it's complete, just wrong, and the 

Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-09-01 Thread Jakub Jelinek via Gcc-patches
On Wed, Aug 31, 2022 at 12:14:22PM -0400, Jason Merrill wrote:
> On 8/31/22 11:07, Jakub Jelinek wrote:
> > On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote:
> > > It could be more explicit, but I think we can assume that from the 
> > > existing
> > > wording; it says it designates the named character.  If there is no such
> > > character, that cannot be satisfied, so it must be ill-formed.
> > 
> > Ok.
> > 
> > > > So, we could reject the int h case above and accept silently the others?
> > > 
> > > Why not warn on the others?
> > 
> > We were always silent for the cases like \u123X or \U12345X.
> > Do you think we should emit some warnings (but never pedwarns/errors in that
> > case) that it is universal character name like but not completely?
> 
> I think that would be helpful, at least for \u{ and \N{.

Ok.

> > Given what you said above, I think that is what we want for the last 2
> > for C++23, the question is if it is ok also for C++20/C17 etc. and whether
> > it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode
> > or not in that case.  We could handle those 2 also differently, just
> > warn instead of error for the \N{ABC} case if not in C++23 mode when
> > identifier_pos.
> 
> That sounds right.
> 
> > Here is an incremental version of the patch which will make valid
> > \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU
> > modes before C++23 and split it as separate tokens in ISO modes.
> 
> Looks good.

Here is a patch which implements that.
I just wonder if we shouldn't have some warning option that would cover
these warnings, currently one needs to use -w to disable those warnings.

Apparently clang uses -Wunicode option to cover these, but unfortunately
they don't bother to document it (nor almost any other warning option),
so it is unclear what else exactly it covers.  Plus a question is how
we should document that option for GCC...

2022-09-01  Jakub Jelinek  

* charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't
handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}.
In possible identifier contexts, don't emit an error and punt
if \N isn't followed by {, or if \N{} surrounds some lower case
letters or _.  In possible identifier contexts when not C++23, don't
emit an error but warning about unknown character names and treat as
separate tokens.  When treating as separate tokens \u{ or \N{, emit
warnings.

* c-c++-common/cpp/delimited-escape-seq-4.c: New test.
* c-c++-common/cpp/delimited-escape-seq-5.c: New test.
* c-c++-common/cpp/named-universal-char-escape-5.c: New test.
* c-c++-common/cpp/named-universal-char-escape-6.c: New test.
* g++.dg/cpp23/named-universal-char-escape1.C: New test.
* g++.dg/cpp23/named-universal-char-escape2.C: New test.

--- libcpp/charset.cc.jj2022-09-01 09:47:24.146886929 +0200
+++ libcpp/charset.cc   2022-09-01 12:52:28.424034208 +0200
@@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const
   if (str[-1] == 'u')
 {
   length = 4;
-  if (str < limit && *str == '{')
+  if (str < limit
+ && *str == '{'
+ && (!identifier_pos
+ || CPP_OPTION (pfile, delimited_escape_seqs)
+ || !CPP_OPTION (pfile, std)))
{
  str++;
  /* Magic value to indicate no digits seen.  */
@@ -1462,8 +1466,22 @@ _cpp_valid_ucn (cpp_reader *pfile, const
   else if (str[-1] == 'N')
 {
   length = 4;
+  if (identifier_pos
+ && !CPP_OPTION (pfile, delimited_escape_seqs)
+ && CPP_OPTION (pfile, std))
+   {
+ *cp = 0;
+ return false;
+   }
   if (str == limit || *str != '{')
-   cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+   {
+ if (identifier_pos)
+   {
+ *cp = 0;
+ return false;
+   }
+ cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+   }
   else
{
  str++;
@@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const
 
  if (str < limit && *str == '}')
{
- if (name == str && identifier_pos)
+ if (identifier_pos && (name == str || !strict))
{
+ if (name == str)
+   cpp_warning (pfile, CPP_W_NONE,
+"empty named universal character escape "
+"sequence; treating it as separate tokens");
+ else
+   cpp_warning (pfile, CPP_W_NONE,
+"incomplete named universal character escape "
+"sequence; treating it as separate tokens");
  *cp = 0;
  return false;
}
@@ -1515,27 +1541,48 @@ _cpp_valid_ucn (cpp_reader *pfile, const
   

Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-31 Thread Jason Merrill via Gcc-patches

On 8/31/22 11:07, Jakub Jelinek wrote:

On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote:

It could be more explicit, but I think we can assume that from the existing
wording; it says it designates the named character.  If there is no such
character, that cannot be satisfied, so it must be ill-formed.


Ok.


So, we could reject the int h case above and accept silently the others?


Why not warn on the others?


We were always silent for the cases like \u123X or \U12345X.
Do you think we should emit some warnings (but never pedwarns/errors in that
case) that it is universal character name like but not completely?


I think that would be helpful, at least for \u{ and \N{.


The following patch let's us silently accept:
#define z(x) 0
#define a z(
int b = a\u{});
int c = a\u{);
int d = a\N{});
int e = a\N{);
int f = a\u123);
int g = a\U1234567);
int h = a\N);
int i = a\NARG);
int j = a\N{abc});
int k = a\N{ABC.123});

The following 2 will be still rejected with errors:
int l = a\N{ABC});
int m = a\N{LATIN SMALL LETTER A WITH ACUTE});
the first one because ABC is not valid Unicode name and the latter because
it will be int m = aá); and will trigger other errors later.

Given what you said above, I think that is what we want for the last 2
for C++23, the question is if it is ok also for C++20/C17 etc. and whether
it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode
or not in that case.  We could handle those 2 also differently, just
warn instead of error for the \N{ABC} case if not in C++23 mode when
identifier_pos.


That sounds right.


Here is an incremental version of the patch which will make valid
\u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU
modes before C++23 and split it as separate tokens in ISO modes.


Looks good.

Jason



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-31 Thread Jakub Jelinek via Gcc-patches
On Wed, Aug 31, 2022 at 05:07:29PM +0200, Jakub Jelinek via Gcc-patches wrote:
> Given what you said above, I think that is what we want for the last 2
> for C++23, the question is if it is ok also for C++20/C17 etc. and whether
> it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode
> or not in that case.  We could handle those 2 also differently, just
> warn instead of error for the \N{ABC} case if not in C++23 mode when
> identifier_pos.

Here is an incremental version of the patch which will make valid
\u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU
modes before C++23 and split it as separate tokens in ISO modes.

Testcase:
#define z(x) 0
#define a z(
int b = a\u{123});
int c = a\N{LATIN SMALL LETTER A WITH ACUTE});

--- libcpp/charset.cc.jj2022-08-31 16:50:48.862775486 +0200
+++ libcpp/charset.cc   2022-08-31 17:18:59.649257350 +0200
@@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const
   if (str[-1] == 'u')
 {
   length = 4;
-  if (str < limit && *str == '{')
+  if (str < limit
+  && *str == '{'
+  && (!identifier_pos
+ || CPP_OPTION (pfile, delimited_escape_seqs)
+ || !CPP_OPTION (pfile, std)))
{
  str++;
  /* Magic value to indicate no digits seen.  */
@@ -1462,6 +1466,13 @@ _cpp_valid_ucn (cpp_reader *pfile, const
   else if (str[-1] == 'N')
 {
   length = 4;
+  if (identifier_pos
+ && !CPP_OPTION (pfile, delimited_escape_seqs)
+ && CPP_OPTION (pfile, std))
+   {
+ *cp = 0;
+ return false;
+   }
   if (str == limit || *str != '{')
{
  if (identifier_pos)

Jakub



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-31 Thread Jakub Jelinek via Gcc-patches
On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote:
> It could be more explicit, but I think we can assume that from the existing
> wording; it says it designates the named character.  If there is no such
> character, that cannot be satisfied, so it must be ill-formed.

Ok.

> > So, we could reject the int h case above and accept silently the others?
> 
> Why not warn on the others?

We were always silent for the cases like \u123X or \U12345X.
Do you think we should emit some warnings (but never pedwarns/errors in that
case) that it is universal character name like but not completely?

The following patch let's us silently accept:
#define z(x) 0
#define a z(
int b = a\u{});
int c = a\u{);
int d = a\N{});
int e = a\N{);
int f = a\u123);
int g = a\U1234567);
int h = a\N);
int i = a\NARG);
int j = a\N{abc});
int k = a\N{ABC.123});

The following 2 will be still rejected with errors:
int l = a\N{ABC});
int m = a\N{LATIN SMALL LETTER A WITH ACUTE});
the first one because ABC is not valid Unicode name and the latter because
it will be int m = aá); and will trigger other errors later.

Given what you said above, I think that is what we want for the last 2
for C++23, the question is if it is ok also for C++20/C17 etc. and whether
it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode
or not in that case.  We could handle those 2 also differently, just
warn instead of error for the \N{ABC} case if not in C++23 mode when
identifier_pos.

--- libcpp/charset.cc.jj2022-08-31 12:34:18.921176118 +0200
+++ libcpp/charset.cc   2022-08-31 16:50:48.862775486 +0200
@@ -1463,7 +1463,14 @@ _cpp_valid_ucn (cpp_reader *pfile, const
 {
   length = 4;
   if (str == limit || *str != '{')
-   cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+   {
+ if (identifier_pos)
+   {
+ *cp = 0;
+ return false;
+   }
+ cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'");
+   }
   else
{
  str++;
@@ -1489,7 +1496,7 @@ _cpp_valid_ucn (cpp_reader *pfile, const
 
  if (str < limit && *str == '}')
{
- if (name == str && identifier_pos)
+ if (identifier_pos && (name == str || !strict))
{
  *cp = 0;
  return false;

Jakub



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-31 Thread Jason Merrill via Gcc-patches

On 8/31/22 10:35, Jakub Jelinek wrote:

On Tue, Aug 30, 2022 at 11:37:07PM +0200, Jakub Jelinek via Gcc-patches wrote:

If
#define z(x) 0
#define a z(
int x = a\NARG);
is valid in C and C++ <= 20 then
#define z(x) 0
#define a z(
int x = a\N{LATIN SMALL LETTER A WITH ACUTE});
is too and shall preprocess to int x = 0; too.
Which would likely mean that we want to only handle it in identifiers if
in C++23 and not actually treat it as an extension except in literals.

Jason, your toughts about that?


Trying to read again the current C++23 wording, I'm afraid that outside of
the literals both the delimited escape sequences and named universal
characters are a complete nightmare for diagnostics.
Because the wording is that only valid universal character names are
replaced by their corresponding characters.
Ill-formed is only if the hexadecimal digit sequences are valid but
represent a number that is not a UCS scalar value, or represent a control
character.
So, I think we can't reject even
#define z(x) 0
#define a z(
int b = a\u{});
int c = a\u{);
int d = a\N{});
int e = a\N{);
int f = a\u123);
int g = a\U1234567);
int h = a\N{LATIN SMALL LETTER A WITH ACUT});
Couldn't C++23 say at least that if a valid named-universal-character
doesn't designate any character then the program is ill-formed?


It could be more explicit, but I think we can assume that from the 
existing wording; it says it designates the named character.  If there 
is no such character, that cannot be satisfied, so it must be ill-formed.



So, we could reject the int h case above and accept silently the others?


Why not warn on the others?


GCC 12 accepts all these without warning, current trunk rejects the h
case with error and accepts the rest without a warning, clang trunk
emits warnings on all cases but the last and rejects the h case with
an error.

Jakub





Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-31 Thread Jakub Jelinek via Gcc-patches
On Tue, Aug 30, 2022 at 11:37:07PM +0200, Jakub Jelinek via Gcc-patches wrote:
> If
> #define z(x) 0
> #define a z(
> int x = a\NARG);
> is valid in C and C++ <= 20 then
> #define z(x) 0
> #define a z(
> int x = a\N{LATIN SMALL LETTER A WITH ACUTE});
> is too and shall preprocess to int x = 0; too.
> Which would likely mean that we want to only handle it in identifiers if
> in C++23 and not actually treat it as an extension except in literals.
> 
> Jason, your toughts about that?

Trying to read again the current C++23 wording, I'm afraid that outside of
the literals both the delimited escape sequences and named universal
characters are a complete nightmare for diagnostics.
Because the wording is that only valid universal character names are
replaced by their corresponding characters.
Ill-formed is only if the hexadecimal digit sequences are valid but
represent a number that is not a UCS scalar value, or represent a control
character.
So, I think we can't reject even
#define z(x) 0
#define a z(
int b = a\u{});
int c = a\u{);
int d = a\N{});
int e = a\N{);
int f = a\u123);
int g = a\U1234567);
int h = a\N{LATIN SMALL LETTER A WITH ACUT});
Couldn't C++23 say at least that if a valid named-universal-character
doesn't designate any character then the program is ill-formed?
So, we could reject the int h case above and accept silently the others?

GCC 12 accepts all these without warning, current trunk rejects the h
case with error and accepts the rest without a warning, clang trunk
emits warnings on all cases but the last and rejects the h case with
an error.

Jakub



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-31 Thread Jason Merrill via Gcc-patches

On 8/30/22 17:37, Jakub Jelinek wrote:

On Tue, Aug 30, 2022 at 11:18:20PM +0200, Jakub Jelinek via Gcc-patches wrote:

On Tue, Aug 30, 2022 at 09:10:37PM +, Joseph Myers wrote:

I'm seeing build failures of glibc for powerpc64, as illustrated by the
following C code:

#if 0
\NARG
#endif

(the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef
__ASSEMBLER__).

This shows some problems with this feature - and with delimited escape
sequences - as it affects C.  It's fine to accept it as an extension
inside string and character literals, because \N or \u{...} would be
invalid in the absence of the feature (i.e. the syntax for such literals
fails to match, meaning that the rule about undefined behavior for a
single ' or " as a pp-token applies).  But outside string and character
literals, the usual lexing rules apply, the \ is a pp-token on its own and
the code is valid at the preprocessing level, and with expansion of macros
appearing before or after the \ (e.g. u defined as a macro in the \u{...}
case) it may be valid code at the language level as well.  I don't know
what older C++ versions say about this, but for C this means e.g.

#define z(x) 0
#define a z(
int x = a\NARG);

needs to be accepted as expanding to "int x = 0;", not interpreted as
using the \N feature in an identifier and produce an error.


Thanks, will look at it tomorrow.


If
#define z(x) 0
#define a z(
int x = a\NARG);
is valid in C and C++ <= 20 then
#define z(x) 0
#define a z(
int x = a\N{LATIN SMALL LETTER A WITH ACUTE});
is too and shall preprocess to int x = 0; too.
Which would likely mean that we want to only handle it in identifiers if
in C++23 and not actually treat it as an extension except in literals.

Jason, your thoughts about that?


The problem in glibc is with \N that is not followed by {; it certainly 
seems that we need to not treat that as an ill-formed named universal 
char escape outside of a literal.  I think for an actual \N{...} 
sequence, we should treat it as an extension unless in strict 
conformance mode, kind of like we do with the ext_numeric_numerals flag.


Jason



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-30 Thread Jakub Jelinek via Gcc-patches
On Tue, Aug 30, 2022 at 11:18:20PM +0200, Jakub Jelinek via Gcc-patches wrote:
> On Tue, Aug 30, 2022 at 09:10:37PM +, Joseph Myers wrote:
> > I'm seeing build failures of glibc for powerpc64, as illustrated by the 
> > following C code:
> > 
> > #if 0
> > \NARG
> > #endif
> > 
> > (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef 
> > __ASSEMBLER__).
> > 
> > This shows some problems with this feature - and with delimited escape 
> > sequences - as it affects C.  It's fine to accept it as an extension 
> > inside string and character literals, because \N or \u{...} would be 
> > invalid in the absence of the feature (i.e. the syntax for such literals 
> > fails to match, meaning that the rule about undefined behavior for a 
> > single ' or " as a pp-token applies).  But outside string and character 
> > literals, the usual lexing rules apply, the \ is a pp-token on its own and 
> > the code is valid at the preprocessing level, and with expansion of macros 
> > appearing before or after the \ (e.g. u defined as a macro in the \u{...} 
> > case) it may be valid code at the language level as well.  I don't know 
> > what older C++ versions say about this, but for C this means e.g.
> > 
> > #define z(x) 0
> > #define a z(
> > int x = a\NARG);
> > 
> > needs to be accepted as expanding to "int x = 0;", not interpreted as 
> > using the \N feature in an identifier and produce an error.
> 
> Thanks, will look at it tomorrow.

If
#define z(x) 0
#define a z(
int x = a\NARG);
is valid in C and C++ <= 20 then
#define z(x) 0
#define a z(
int x = a\N{LATIN SMALL LETTER A WITH ACUTE});
is too and shall preprocess to int x = 0; too.
Which would likely mean that we want to only handle it in identifiers if
in C++23 and not actually treat it as an extension except in literals.

Jason, your toughts about that?

Jakub



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-30 Thread Jakub Jelinek via Gcc-patches
On Tue, Aug 30, 2022 at 09:10:37PM +, Joseph Myers wrote:
> I'm seeing build failures of glibc for powerpc64, as illustrated by the 
> following C code:
> 
> #if 0
> \NARG
> #endif
> 
> (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef 
> __ASSEMBLER__).
> 
> This shows some problems with this feature - and with delimited escape 
> sequences - as it affects C.  It's fine to accept it as an extension 
> inside string and character literals, because \N or \u{...} would be 
> invalid in the absence of the feature (i.e. the syntax for such literals 
> fails to match, meaning that the rule about undefined behavior for a 
> single ' or " as a pp-token applies).  But outside string and character 
> literals, the usual lexing rules apply, the \ is a pp-token on its own and 
> the code is valid at the preprocessing level, and with expansion of macros 
> appearing before or after the \ (e.g. u defined as a macro in the \u{...} 
> case) it may be valid code at the language level as well.  I don't know 
> what older C++ versions say about this, but for C this means e.g.
> 
> #define z(x) 0
> #define a z(
> int x = a\NARG);
> 
> needs to be accepted as expanding to "int x = 0;", not interpreted as 
> using the \N feature in an identifier and produce an error.

Thanks, will look at it tomorrow.

Jakub



Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-30 Thread Joseph Myers
I'm seeing build failures of glibc for powerpc64, as illustrated by the 
following C code:

#if 0
\NARG
#endif

(the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef 
__ASSEMBLER__).

This shows some problems with this feature - and with delimited escape 
sequences - as it affects C.  It's fine to accept it as an extension 
inside string and character literals, because \N or \u{...} would be 
invalid in the absence of the feature (i.e. the syntax for such literals 
fails to match, meaning that the rule about undefined behavior for a 
single ' or " as a pp-token applies).  But outside string and character 
literals, the usual lexing rules apply, the \ is a pp-token on its own and 
the code is valid at the preprocessing level, and with expansion of macros 
appearing before or after the \ (e.g. u defined as a macro in the \u{...} 
case) it may be valid code at the language level as well.  I don't know 
what older C++ versions say about this, but for C this means e.g.

#define z(x) 0
#define a z(
int x = a\NARG);

needs to be accepted as expanding to "int x = 0;", not interpreted as 
using the \N feature in an identifier and produce an error.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-25 Thread Jason Merrill via Gcc-patches

On 8/25/22 04:49, Jakub Jelinek wrote:

On Wed, Aug 24, 2022 at 04:22:17PM -0400, Jason Merrill wrote:

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?


Does the copyright 2005-2022 mean that this code is partly derived from some
other?


Yes, I was lazy and started by copying over makeucnid.cc which also
parses UnicodeData.txt.


Makes sense, please mention that in the ChangeLog.  OK with that change.


In the end, according to diff -upd -U1 make{ucnid,uname2c}.cc, there are
~180 lines in common (out of ~530 lines of makeucnid.cc), out of which is
~80 lines in the two copyrights, most of the rest are just empty lines or
lines with { or } alone, beyond that
  #include 
  #include 
  #include 
  #include 
  #include 
  
  
  #define NUM_CODE_POINTS 0x11

  #define MAX_CODE_POINT 0x10
and
  /* Read UnicodeData.txt and fill in the 'decomp' table to be the
 decompositions of characters for which both the character
 decomposed and all the code points in the decomposition are valid
 for some supported language version, and the 'all_decomp' table to
 be the decompositions of all characters without those
 constraints.  */
  
  static void

  {
if (!f)
for (;;)
  {
char line[256];
char *l;
  
if (!fgets (line, sizeof (line), f))

 break;
codepoint = strtoul (line, , 16);
if (l == line || *l != ';')
if (codepoint > MAX_CODE_POINT)
  
do {

} while (*l != ';');
 {
 }
  }
if (ferror (f))
fclose (f);
  }
are the common lines close to each other (and whole
write_copyright function).  Dunno if with that I could use
just 2022 copyright or not.


+ /* We don't know what the next letter will be.
+It could be ISALNUM, then we are supposed
+to omit it, or it could be a space and then
+we should not omit it and need to compare it.
+Fortunately the only 3 names with hyphen
+followed by non-letter are
+U+0F0A TIBETAN MARK BKA- SHOG YIG MGO
+U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN
+U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
+and makeuname2c.cc verifies this.
+Furthermore, prefixes of NR2 generated
+ranges all end with a hyphen, but the generated
+part is then followed by alpha-numeric.
+So, let's just assume that - at the end of
+key is always followed by alphanumeric and
+so should be omitted.  */


Let's mention that makeuname2c.cc verifies this property.


I had "and makeuname2c.cc verifies this." there already a few lines before,
but I agree it is better to move that to the end.


+ for (j = start; j < end; j++)
+   {
+ /* Actually strlen, but we know strlen () <= 3.  */


Is this comment saying that you're using a loop instead of calling strlen
because you know the result will be small?  That seems an odd choice.


Yes, but perhaps it is a micro-optimization and maybe the Korean characters
will not be used that much that it isn't worth it.
Our optimizers certainly aren't able to figure out that when
strlen is called on an array element with size 4 that calling library
function isn't the best idea.  The string lengths are 0 in 3%, 1 in 44%,
2 in 47% and 3 in 6% of cases.
At least on x86_64 when I just use this_len = strlen (hangul_syllables[j]);
it calls the library routine.
Changed to this_len = strlen (hangul_syllables[j]);


+ /* Try to do a loose name lookup according to
+Unicode loose matching rule UAX44-LM2.


Maybe factor the loose lookup into a separate function?


Good idea.


+ bidi::kind kind;
+ if (buffer->cur[-1] == 'N')
+   kind = get_bidi_named (pfile, buffer->cur, );
+ else
+   kind = get_bidi_ucn (pfile, buffer->cur,
+buffer->cur[-1] == 'U', );


Hmm, I'm surprised that we're doing bidi checking before replacing escape
characters with elements of the translation character set.  So now we need
to check it three different ways.


It is unfortunate, but I'm afraid it is intentional.
Because after replacing the escape characters we lose the distinction
between characters written as UTF-8 in the source and the escape sequences.
The former need to be treated differently as they are more dangerous than
the latter, bidi written as UTF-8 can mislead what the source contains
already in (some) text editors or whatever way user looks at the source
code, while when written as UCNs (\u, \u{}, \U, \N{}) it can be 

[PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]

2022-08-25 Thread Jakub Jelinek via Gcc-patches
On Wed, Aug 24, 2022 at 04:22:17PM -0400, Jason Merrill wrote:
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> Does the copyright 2005-2022 mean that this code is partly derived from some
> other?

Yes, I was lazy and started by copying over makeucnid.cc which also
parses UnicodeData.txt.
In the end, according to diff -upd -U1 make{ucnid,uname2c}.cc, there are
~180 lines in common (out of ~530 lines of makeucnid.cc), out of which is
~80 lines in the two copyrights, most of the rest are just empty lines or
lines with { or } alone, beyond that
 #include 
 #include 
 #include 
 #include 
 #include 
 
 
 #define NUM_CODE_POINTS 0x11
 #define MAX_CODE_POINT 0x10
and
 /* Read UnicodeData.txt and fill in the 'decomp' table to be the
decompositions of characters for which both the character
decomposed and all the code points in the decomposition are valid
for some supported language version, and the 'all_decomp' table to
be the decompositions of all characters without those
constraints.  */
 
 static void
 {
   if (!f)
   for (;;)
 {
   char line[256];
   char *l;
 
   if (!fgets (line, sizeof (line), f))
break;
   codepoint = strtoul (line, , 16);
   if (l == line || *l != ';')
   if (codepoint > MAX_CODE_POINT)
 
   do {
   } while (*l != ';');
{
}
 }
   if (ferror (f))
   fclose (f);
 }
are the common lines close to each other (and whole
write_copyright function).  Dunno if with that I could use
just 2022 copyright or not.

> > + /* We don't know what the next letter will be.
> > +It could be ISALNUM, then we are supposed
> > +to omit it, or it could be a space and then
> > +we should not omit it and need to compare it.
> > +Fortunately the only 3 names with hyphen
> > +followed by non-letter are
> > +U+0F0A TIBETAN MARK BKA- SHOG YIG MGO
> > +U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN
> > +U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
> > +and makeuname2c.cc verifies this.
> > +Furthermore, prefixes of NR2 generated
> > +ranges all end with a hyphen, but the generated
> > +part is then followed by alpha-numeric.
> > +So, let's just assume that - at the end of
> > +key is always followed by alphanumeric and
> > +so should be omitted.  */
> 
> Let's mention that makeuname2c.cc verifies this property.

I had "and makeuname2c.cc verifies this." there already a few lines before,
but I agree it is better to move that to the end.

> > + for (j = start; j < end; j++)
> > +   {
> > + /* Actually strlen, but we know strlen () <= 3.  */
> 
> Is this comment saying that you're using a loop instead of calling strlen
> because you know the result will be small?  That seems an odd choice.

Yes, but perhaps it is a micro-optimization and maybe the Korean characters
will not be used that much that it isn't worth it.
Our optimizers certainly aren't able to figure out that when
strlen is called on an array element with size 4 that calling library
function isn't the best idea.  The string lengths are 0 in 3%, 1 in 44%,
2 in 47% and 3 in 6% of cases.
At least on x86_64 when I just use this_len = strlen (hangul_syllables[j]);
it calls the library routine.
Changed to this_len = strlen (hangul_syllables[j]);

> > + /* Try to do a loose name lookup according to
> > +Unicode loose matching rule UAX44-LM2.
> 
> Maybe factor the loose lookup into a separate function?

Good idea.

> > + bidi::kind kind;
> > + if (buffer->cur[-1] == 'N')
> > +   kind = get_bidi_named (pfile, buffer->cur, );
> > + else
> > +   kind = get_bidi_ucn (pfile, buffer->cur,
> > +buffer->cur[-1] == 'U', );
> 
> Hmm, I'm surprised that we're doing bidi checking before replacing escape
> characters with elements of the translation character set.  So now we need
> to check it three different ways.

It is unfortunate, but I'm afraid it is intentional.
Because after replacing the escape characters we lose the distinction
between characters written as UTF-8 in the source and the escape sequences.
The former need to be treated differently as they are more dangerous than
the latter, bidi written as UTF-8 can mislead what the source contains
already in (some) text editors or whatever way user looks at the source
code, while when written as UCNs (\u, \u{}, \U, \N{}) it can be dangerous
only when the program emits it at runtime unpaired.

Here is incremental diff and full patch (with the huge uname2c.h generated
header