Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Thu, Sep 01, 2022 at 03:00:28PM -0400, Jason Merrill wrote: > > Apparently clang uses -Wunicode option to cover these, but unfortunately > > they don't bother to document it (nor almost any other warning option), > > so it is unclear what else exactly it covers. Plus a question is how > > we should document that option for GCC... > > We might as well use the same flag name, and document it to mean what it > currently means for GCC. Ok, will work on that tomorrow. > > @@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const > > if (str < limit && *str == '}') > > { > > - if (name == str && identifier_pos) > > + if (identifier_pos && (name == str || !strict)) > > { > > + if (name == str) > > + cpp_warning (pfile, CPP_W_NONE, > > +"empty named universal character escape " > > +"sequence; treating it as separate tokens"); > > + else > > + cpp_warning (pfile, CPP_W_NONE, > > +"incomplete named universal character escape " > > +"sequence; treating it as separate tokens"); > > It looks like this is handling \N{abc}, for which "incomplete" seems like > the wrong description; it's complete, just wrong, and the diagnostic doesn't > help correct it. The point is to make it more consistent with the \N{X.1} handling. The grammar is clear that only upper case letters + digits + space + hyphen can appear in between \N{ and }. So, both of those cases IMHO should be handled the same. The !strict case is if there is at least one lower case letter or underscore but no other characters than letters + digits + space + hyphen + underscore, we then find the terminating } and inside of string/character literals want to do the UAX44LM2 algorithm suggestions. But for X.1 in literals we don't even look for }, we just emit the cpp_error (pfile, CPP_DL_ERROR, "'\\N{' not terminated with '}' after %.*s", (int) (str - base), base); diagnostics which prints after X For the identifier_pos case, both the !strict and *str != '}' cases are the same reason why it is treated as separate tokens, not because the name is not valid, but because it contains invalid characters. So perhaps for the identifier_pos !strict and *str != '}' cases we could emit a warning with the same wording as above (but so that we stop for !strict on the first lowercase or _ char just break instead of set strict = true if identifier_pos). Or we could emit such a warning and a note that would clarify that only upper case letters, digits, space or hyphen are allowed there? Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On 9/1/22 07:14, Jakub Jelinek wrote: On Wed, Aug 31, 2022 at 12:14:22PM -0400, Jason Merrill wrote: On 8/31/22 11:07, Jakub Jelinek wrote: On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote: It could be more explicit, but I think we can assume that from the existing wording; it says it designates the named character. If there is no such character, that cannot be satisfied, so it must be ill-formed. Ok. So, we could reject the int h case above and accept silently the others? Why not warn on the others? We were always silent for the cases like \u123X or \U12345X. Do you think we should emit some warnings (but never pedwarns/errors in that case) that it is universal character name like but not completely? I think that would be helpful, at least for \u{ and \N{. Ok. Given what you said above, I think that is what we want for the last 2 for C++23, the question is if it is ok also for C++20/C17 etc. and whether it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode or not in that case. We could handle those 2 also differently, just warn instead of error for the \N{ABC} case if not in C++23 mode when identifier_pos. That sounds right. Here is an incremental version of the patch which will make valid \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU modes before C++23 and split it as separate tokens in ISO modes. Looks good. Here is a patch which implements that. I just wonder if we shouldn't have some warning option that would cover these warnings, currently one needs to use -w to disable those warnings. Apparently clang uses -Wunicode option to cover these, but unfortunately they don't bother to document it (nor almost any other warning option), so it is unclear what else exactly it covers. Plus a question is how we should document that option for GCC... We might as well use the same flag name, and document it to mean what it currently means for GCC. 2022-09-01 Jakub Jelinek * charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}. In possible identifier contexts, don't emit an error and punt if \N isn't followed by {, or if \N{} surrounds some lower case letters or _. In possible identifier contexts when not C++23, don't emit an error but warning about unknown character names and treat as separate tokens. When treating as separate tokens \u{ or \N{, emit warnings. * c-c++-common/cpp/delimited-escape-seq-4.c: New test. * c-c++-common/cpp/delimited-escape-seq-5.c: New test. * c-c++-common/cpp/named-universal-char-escape-5.c: New test. * c-c++-common/cpp/named-universal-char-escape-6.c: New test. * g++.dg/cpp23/named-universal-char-escape1.C: New test. * g++.dg/cpp23/named-universal-char-escape2.C: New test. --- libcpp/charset.cc.jj2022-09-01 09:47:24.146886929 +0200 +++ libcpp/charset.cc 2022-09-01 12:52:28.424034208 +0200 @@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const if (str[-1] == 'u') { length = 4; - if (str < limit && *str == '{') + if (str < limit + && *str == '{' + && (!identifier_pos + || CPP_OPTION (pfile, delimited_escape_seqs) + || !CPP_OPTION (pfile, std))) { str++; /* Magic value to indicate no digits seen. */ @@ -1462,8 +1466,22 @@ _cpp_valid_ucn (cpp_reader *pfile, const else if (str[-1] == 'N') { length = 4; + if (identifier_pos + && !CPP_OPTION (pfile, delimited_escape_seqs) + && CPP_OPTION (pfile, std)) + { + *cp = 0; + return false; + } if (str == limit || *str != '{') - cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'"); + { + if (identifier_pos) + { + *cp = 0; + return false; + } + cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'"); + } else { str++; @@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const if (str < limit && *str == '}') { - if (name == str && identifier_pos) + if (identifier_pos && (name == str || !strict)) { + if (name == str) + cpp_warning (pfile, CPP_W_NONE, +"empty named universal character escape " +"sequence; treating it as separate tokens"); + else + cpp_warning (pfile, CPP_W_NONE, +"incomplete named universal character escape " +"sequence; treating it as separate tokens"); It looks like this is handling \N{abc}, for which "incomplete" seems like the wrong description; it's complete, just wrong, and the
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Wed, Aug 31, 2022 at 12:14:22PM -0400, Jason Merrill wrote: > On 8/31/22 11:07, Jakub Jelinek wrote: > > On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote: > > > It could be more explicit, but I think we can assume that from the > > > existing > > > wording; it says it designates the named character. If there is no such > > > character, that cannot be satisfied, so it must be ill-formed. > > > > Ok. > > > > > > So, we could reject the int h case above and accept silently the others? > > > > > > Why not warn on the others? > > > > We were always silent for the cases like \u123X or \U12345X. > > Do you think we should emit some warnings (but never pedwarns/errors in that > > case) that it is universal character name like but not completely? > > I think that would be helpful, at least for \u{ and \N{. Ok. > > Given what you said above, I think that is what we want for the last 2 > > for C++23, the question is if it is ok also for C++20/C17 etc. and whether > > it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode > > or not in that case. We could handle those 2 also differently, just > > warn instead of error for the \N{ABC} case if not in C++23 mode when > > identifier_pos. > > That sounds right. > > > Here is an incremental version of the patch which will make valid > > \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU > > modes before C++23 and split it as separate tokens in ISO modes. > > Looks good. Here is a patch which implements that. I just wonder if we shouldn't have some warning option that would cover these warnings, currently one needs to use -w to disable those warnings. Apparently clang uses -Wunicode option to cover these, but unfortunately they don't bother to document it (nor almost any other warning option), so it is unclear what else exactly it covers. Plus a question is how we should document that option for GCC... 2022-09-01 Jakub Jelinek * charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}. In possible identifier contexts, don't emit an error and punt if \N isn't followed by {, or if \N{} surrounds some lower case letters or _. In possible identifier contexts when not C++23, don't emit an error but warning about unknown character names and treat as separate tokens. When treating as separate tokens \u{ or \N{, emit warnings. * c-c++-common/cpp/delimited-escape-seq-4.c: New test. * c-c++-common/cpp/delimited-escape-seq-5.c: New test. * c-c++-common/cpp/named-universal-char-escape-5.c: New test. * c-c++-common/cpp/named-universal-char-escape-6.c: New test. * g++.dg/cpp23/named-universal-char-escape1.C: New test. * g++.dg/cpp23/named-universal-char-escape2.C: New test. --- libcpp/charset.cc.jj2022-09-01 09:47:24.146886929 +0200 +++ libcpp/charset.cc 2022-09-01 12:52:28.424034208 +0200 @@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const if (str[-1] == 'u') { length = 4; - if (str < limit && *str == '{') + if (str < limit + && *str == '{' + && (!identifier_pos + || CPP_OPTION (pfile, delimited_escape_seqs) + || !CPP_OPTION (pfile, std))) { str++; /* Magic value to indicate no digits seen. */ @@ -1462,8 +1466,22 @@ _cpp_valid_ucn (cpp_reader *pfile, const else if (str[-1] == 'N') { length = 4; + if (identifier_pos + && !CPP_OPTION (pfile, delimited_escape_seqs) + && CPP_OPTION (pfile, std)) + { + *cp = 0; + return false; + } if (str == limit || *str != '{') - cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'"); + { + if (identifier_pos) + { + *cp = 0; + return false; + } + cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'"); + } else { str++; @@ -1489,8 +1507,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const if (str < limit && *str == '}') { - if (name == str && identifier_pos) + if (identifier_pos && (name == str || !strict)) { + if (name == str) + cpp_warning (pfile, CPP_W_NONE, +"empty named universal character escape " +"sequence; treating it as separate tokens"); + else + cpp_warning (pfile, CPP_W_NONE, +"incomplete named universal character escape " +"sequence; treating it as separate tokens"); *cp = 0; return false; } @@ -1515,27 +1541,48 @@ _cpp_valid_ucn (cpp_reader *pfile, const
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On 8/31/22 11:07, Jakub Jelinek wrote: On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote: It could be more explicit, but I think we can assume that from the existing wording; it says it designates the named character. If there is no such character, that cannot be satisfied, so it must be ill-formed. Ok. So, we could reject the int h case above and accept silently the others? Why not warn on the others? We were always silent for the cases like \u123X or \U12345X. Do you think we should emit some warnings (but never pedwarns/errors in that case) that it is universal character name like but not completely? I think that would be helpful, at least for \u{ and \N{. The following patch let's us silently accept: #define z(x) 0 #define a z( int b = a\u{}); int c = a\u{); int d = a\N{}); int e = a\N{); int f = a\u123); int g = a\U1234567); int h = a\N); int i = a\NARG); int j = a\N{abc}); int k = a\N{ABC.123}); The following 2 will be still rejected with errors: int l = a\N{ABC}); int m = a\N{LATIN SMALL LETTER A WITH ACUTE}); the first one because ABC is not valid Unicode name and the latter because it will be int m = aá); and will trigger other errors later. Given what you said above, I think that is what we want for the last 2 for C++23, the question is if it is ok also for C++20/C17 etc. and whether it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode or not in that case. We could handle those 2 also differently, just warn instead of error for the \N{ABC} case if not in C++23 mode when identifier_pos. That sounds right. Here is an incremental version of the patch which will make valid \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU modes before C++23 and split it as separate tokens in ISO modes. Looks good. Jason
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Wed, Aug 31, 2022 at 05:07:29PM +0200, Jakub Jelinek via Gcc-patches wrote: > Given what you said above, I think that is what we want for the last 2 > for C++23, the question is if it is ok also for C++20/C17 etc. and whether > it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode > or not in that case. We could handle those 2 also differently, just > warn instead of error for the \N{ABC} case if not in C++23 mode when > identifier_pos. Here is an incremental version of the patch which will make valid \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} an extension in GNU modes before C++23 and split it as separate tokens in ISO modes. Testcase: #define z(x) 0 #define a z( int b = a\u{123}); int c = a\N{LATIN SMALL LETTER A WITH ACUTE}); --- libcpp/charset.cc.jj2022-08-31 16:50:48.862775486 +0200 +++ libcpp/charset.cc 2022-08-31 17:18:59.649257350 +0200 @@ -1448,7 +1448,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const if (str[-1] == 'u') { length = 4; - if (str < limit && *str == '{') + if (str < limit + && *str == '{' + && (!identifier_pos + || CPP_OPTION (pfile, delimited_escape_seqs) + || !CPP_OPTION (pfile, std))) { str++; /* Magic value to indicate no digits seen. */ @@ -1462,6 +1466,13 @@ _cpp_valid_ucn (cpp_reader *pfile, const else if (str[-1] == 'N') { length = 4; + if (identifier_pos + && !CPP_OPTION (pfile, delimited_escape_seqs) + && CPP_OPTION (pfile, std)) + { + *cp = 0; + return false; + } if (str == limit || *str != '{') { if (identifier_pos) Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Wed, Aug 31, 2022 at 10:52:49AM -0400, Jason Merrill wrote: > It could be more explicit, but I think we can assume that from the existing > wording; it says it designates the named character. If there is no such > character, that cannot be satisfied, so it must be ill-formed. Ok. > > So, we could reject the int h case above and accept silently the others? > > Why not warn on the others? We were always silent for the cases like \u123X or \U12345X. Do you think we should emit some warnings (but never pedwarns/errors in that case) that it is universal character name like but not completely? The following patch let's us silently accept: #define z(x) 0 #define a z( int b = a\u{}); int c = a\u{); int d = a\N{}); int e = a\N{); int f = a\u123); int g = a\U1234567); int h = a\N); int i = a\NARG); int j = a\N{abc}); int k = a\N{ABC.123}); The following 2 will be still rejected with errors: int l = a\N{ABC}); int m = a\N{LATIN SMALL LETTER A WITH ACUTE}); the first one because ABC is not valid Unicode name and the latter because it will be int m = aá); and will trigger other errors later. Given what you said above, I think that is what we want for the last 2 for C++23, the question is if it is ok also for C++20/C17 etc. and whether it should depend on -pedantic or -pedantic-errors or GNU vs. ISO mode or not in that case. We could handle those 2 also differently, just warn instead of error for the \N{ABC} case if not in C++23 mode when identifier_pos. --- libcpp/charset.cc.jj2022-08-31 12:34:18.921176118 +0200 +++ libcpp/charset.cc 2022-08-31 16:50:48.862775486 +0200 @@ -1463,7 +1463,14 @@ _cpp_valid_ucn (cpp_reader *pfile, const { length = 4; if (str == limit || *str != '{') - cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'"); + { + if (identifier_pos) + { + *cp = 0; + return false; + } + cpp_error (pfile, CPP_DL_ERROR, "'\\N' not followed by '{'"); + } else { str++; @@ -1489,7 +1496,7 @@ _cpp_valid_ucn (cpp_reader *pfile, const if (str < limit && *str == '}') { - if (name == str && identifier_pos) + if (identifier_pos && (name == str || !strict)) { *cp = 0; return false; Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On 8/31/22 10:35, Jakub Jelinek wrote: On Tue, Aug 30, 2022 at 11:37:07PM +0200, Jakub Jelinek via Gcc-patches wrote: If #define z(x) 0 #define a z( int x = a\NARG); is valid in C and C++ <= 20 then #define z(x) 0 #define a z( int x = a\N{LATIN SMALL LETTER A WITH ACUTE}); is too and shall preprocess to int x = 0; too. Which would likely mean that we want to only handle it in identifiers if in C++23 and not actually treat it as an extension except in literals. Jason, your toughts about that? Trying to read again the current C++23 wording, I'm afraid that outside of the literals both the delimited escape sequences and named universal characters are a complete nightmare for diagnostics. Because the wording is that only valid universal character names are replaced by their corresponding characters. Ill-formed is only if the hexadecimal digit sequences are valid but represent a number that is not a UCS scalar value, or represent a control character. So, I think we can't reject even #define z(x) 0 #define a z( int b = a\u{}); int c = a\u{); int d = a\N{}); int e = a\N{); int f = a\u123); int g = a\U1234567); int h = a\N{LATIN SMALL LETTER A WITH ACUT}); Couldn't C++23 say at least that if a valid named-universal-character doesn't designate any character then the program is ill-formed? It could be more explicit, but I think we can assume that from the existing wording; it says it designates the named character. If there is no such character, that cannot be satisfied, so it must be ill-formed. So, we could reject the int h case above and accept silently the others? Why not warn on the others? GCC 12 accepts all these without warning, current trunk rejects the h case with error and accepts the rest without a warning, clang trunk emits warnings on all cases but the last and rejects the h case with an error. Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Tue, Aug 30, 2022 at 11:37:07PM +0200, Jakub Jelinek via Gcc-patches wrote: > If > #define z(x) 0 > #define a z( > int x = a\NARG); > is valid in C and C++ <= 20 then > #define z(x) 0 > #define a z( > int x = a\N{LATIN SMALL LETTER A WITH ACUTE}); > is too and shall preprocess to int x = 0; too. > Which would likely mean that we want to only handle it in identifiers if > in C++23 and not actually treat it as an extension except in literals. > > Jason, your toughts about that? Trying to read again the current C++23 wording, I'm afraid that outside of the literals both the delimited escape sequences and named universal characters are a complete nightmare for diagnostics. Because the wording is that only valid universal character names are replaced by their corresponding characters. Ill-formed is only if the hexadecimal digit sequences are valid but represent a number that is not a UCS scalar value, or represent a control character. So, I think we can't reject even #define z(x) 0 #define a z( int b = a\u{}); int c = a\u{); int d = a\N{}); int e = a\N{); int f = a\u123); int g = a\U1234567); int h = a\N{LATIN SMALL LETTER A WITH ACUT}); Couldn't C++23 say at least that if a valid named-universal-character doesn't designate any character then the program is ill-formed? So, we could reject the int h case above and accept silently the others? GCC 12 accepts all these without warning, current trunk rejects the h case with error and accepts the rest without a warning, clang trunk emits warnings on all cases but the last and rejects the h case with an error. Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On 8/30/22 17:37, Jakub Jelinek wrote: On Tue, Aug 30, 2022 at 11:18:20PM +0200, Jakub Jelinek via Gcc-patches wrote: On Tue, Aug 30, 2022 at 09:10:37PM +, Joseph Myers wrote: I'm seeing build failures of glibc for powerpc64, as illustrated by the following C code: #if 0 \NARG #endif (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef __ASSEMBLER__). This shows some problems with this feature - and with delimited escape sequences - as it affects C. It's fine to accept it as an extension inside string and character literals, because \N or \u{...} would be invalid in the absence of the feature (i.e. the syntax for such literals fails to match, meaning that the rule about undefined behavior for a single ' or " as a pp-token applies). But outside string and character literals, the usual lexing rules apply, the \ is a pp-token on its own and the code is valid at the preprocessing level, and with expansion of macros appearing before or after the \ (e.g. u defined as a macro in the \u{...} case) it may be valid code at the language level as well. I don't know what older C++ versions say about this, but for C this means e.g. #define z(x) 0 #define a z( int x = a\NARG); needs to be accepted as expanding to "int x = 0;", not interpreted as using the \N feature in an identifier and produce an error. Thanks, will look at it tomorrow. If #define z(x) 0 #define a z( int x = a\NARG); is valid in C and C++ <= 20 then #define z(x) 0 #define a z( int x = a\N{LATIN SMALL LETTER A WITH ACUTE}); is too and shall preprocess to int x = 0; too. Which would likely mean that we want to only handle it in identifiers if in C++23 and not actually treat it as an extension except in literals. Jason, your thoughts about that? The problem in glibc is with \N that is not followed by {; it certainly seems that we need to not treat that as an ill-formed named universal char escape outside of a literal. I think for an actual \N{...} sequence, we should treat it as an extension unless in strict conformance mode, kind of like we do with the ext_numeric_numerals flag. Jason
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Tue, Aug 30, 2022 at 11:18:20PM +0200, Jakub Jelinek via Gcc-patches wrote: > On Tue, Aug 30, 2022 at 09:10:37PM +, Joseph Myers wrote: > > I'm seeing build failures of glibc for powerpc64, as illustrated by the > > following C code: > > > > #if 0 > > \NARG > > #endif > > > > (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef > > __ASSEMBLER__). > > > > This shows some problems with this feature - and with delimited escape > > sequences - as it affects C. It's fine to accept it as an extension > > inside string and character literals, because \N or \u{...} would be > > invalid in the absence of the feature (i.e. the syntax for such literals > > fails to match, meaning that the rule about undefined behavior for a > > single ' or " as a pp-token applies). But outside string and character > > literals, the usual lexing rules apply, the \ is a pp-token on its own and > > the code is valid at the preprocessing level, and with expansion of macros > > appearing before or after the \ (e.g. u defined as a macro in the \u{...} > > case) it may be valid code at the language level as well. I don't know > > what older C++ versions say about this, but for C this means e.g. > > > > #define z(x) 0 > > #define a z( > > int x = a\NARG); > > > > needs to be accepted as expanding to "int x = 0;", not interpreted as > > using the \N feature in an identifier and produce an error. > > Thanks, will look at it tomorrow. If #define z(x) 0 #define a z( int x = a\NARG); is valid in C and C++ <= 20 then #define z(x) 0 #define a z( int x = a\N{LATIN SMALL LETTER A WITH ACUTE}); is too and shall preprocess to int x = 0; too. Which would likely mean that we want to only handle it in identifiers if in C++23 and not actually treat it as an extension except in literals. Jason, your toughts about that? Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Tue, Aug 30, 2022 at 09:10:37PM +, Joseph Myers wrote: > I'm seeing build failures of glibc for powerpc64, as illustrated by the > following C code: > > #if 0 > \NARG > #endif > > (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef > __ASSEMBLER__). > > This shows some problems with this feature - and with delimited escape > sequences - as it affects C. It's fine to accept it as an extension > inside string and character literals, because \N or \u{...} would be > invalid in the absence of the feature (i.e. the syntax for such literals > fails to match, meaning that the rule about undefined behavior for a > single ' or " as a pp-token applies). But outside string and character > literals, the usual lexing rules apply, the \ is a pp-token on its own and > the code is valid at the preprocessing level, and with expansion of macros > appearing before or after the \ (e.g. u defined as a macro in the \u{...} > case) it may be valid code at the language level as well. I don't know > what older C++ versions say about this, but for C this means e.g. > > #define z(x) 0 > #define a z( > int x = a\NARG); > > needs to be accepted as expanding to "int x = 0;", not interpreted as > using the \N feature in an identifier and produce an error. Thanks, will look at it tomorrow. Jakub
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
I'm seeing build failures of glibc for powerpc64, as illustrated by the following C code: #if 0 \NARG #endif (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef __ASSEMBLER__). This shows some problems with this feature - and with delimited escape sequences - as it affects C. It's fine to accept it as an extension inside string and character literals, because \N or \u{...} would be invalid in the absence of the feature (i.e. the syntax for such literals fails to match, meaning that the rule about undefined behavior for a single ' or " as a pp-token applies). But outside string and character literals, the usual lexing rules apply, the \ is a pp-token on its own and the code is valid at the preprocessing level, and with expansion of macros appearing before or after the \ (e.g. u defined as a macro in the \u{...} case) it may be valid code at the language level as well. I don't know what older C++ versions say about this, but for C this means e.g. #define z(x) 0 #define a z( int x = a\NARG); needs to be accepted as expanding to "int x = 0;", not interpreted as using the \N feature in an identifier and produce an error. -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On 8/25/22 04:49, Jakub Jelinek wrote: On Wed, Aug 24, 2022 at 04:22:17PM -0400, Jason Merrill wrote: Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Does the copyright 2005-2022 mean that this code is partly derived from some other? Yes, I was lazy and started by copying over makeucnid.cc which also parses UnicodeData.txt. Makes sense, please mention that in the ChangeLog. OK with that change. In the end, according to diff -upd -U1 make{ucnid,uname2c}.cc, there are ~180 lines in common (out of ~530 lines of makeucnid.cc), out of which is ~80 lines in the two copyrights, most of the rest are just empty lines or lines with { or } alone, beyond that #include #include #include #include #include #define NUM_CODE_POINTS 0x11 #define MAX_CODE_POINT 0x10 and /* Read UnicodeData.txt and fill in the 'decomp' table to be the decompositions of characters for which both the character decomposed and all the code points in the decomposition are valid for some supported language version, and the 'all_decomp' table to be the decompositions of all characters without those constraints. */ static void { if (!f) for (;;) { char line[256]; char *l; if (!fgets (line, sizeof (line), f)) break; codepoint = strtoul (line, , 16); if (l == line || *l != ';') if (codepoint > MAX_CODE_POINT) do { } while (*l != ';'); { } } if (ferror (f)) fclose (f); } are the common lines close to each other (and whole write_copyright function). Dunno if with that I could use just 2022 copyright or not. + /* We don't know what the next letter will be. +It could be ISALNUM, then we are supposed +to omit it, or it could be a space and then +we should not omit it and need to compare it. +Fortunately the only 3 names with hyphen +followed by non-letter are +U+0F0A TIBETAN MARK BKA- SHOG YIG MGO +U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN +U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN +and makeuname2c.cc verifies this. +Furthermore, prefixes of NR2 generated +ranges all end with a hyphen, but the generated +part is then followed by alpha-numeric. +So, let's just assume that - at the end of +key is always followed by alphanumeric and +so should be omitted. */ Let's mention that makeuname2c.cc verifies this property. I had "and makeuname2c.cc verifies this." there already a few lines before, but I agree it is better to move that to the end. + for (j = start; j < end; j++) + { + /* Actually strlen, but we know strlen () <= 3. */ Is this comment saying that you're using a loop instead of calling strlen because you know the result will be small? That seems an odd choice. Yes, but perhaps it is a micro-optimization and maybe the Korean characters will not be used that much that it isn't worth it. Our optimizers certainly aren't able to figure out that when strlen is called on an array element with size 4 that calling library function isn't the best idea. The string lengths are 0 in 3%, 1 in 44%, 2 in 47% and 3 in 6% of cases. At least on x86_64 when I just use this_len = strlen (hangul_syllables[j]); it calls the library routine. Changed to this_len = strlen (hangul_syllables[j]); + /* Try to do a loose name lookup according to +Unicode loose matching rule UAX44-LM2. Maybe factor the loose lookup into a separate function? Good idea. + bidi::kind kind; + if (buffer->cur[-1] == 'N') + kind = get_bidi_named (pfile, buffer->cur, ); + else + kind = get_bidi_ucn (pfile, buffer->cur, +buffer->cur[-1] == 'U', ); Hmm, I'm surprised that we're doing bidi checking before replacing escape characters with elements of the translation character set. So now we need to check it three different ways. It is unfortunate, but I'm afraid it is intentional. Because after replacing the escape characters we lose the distinction between characters written as UTF-8 in the source and the escape sequences. The former need to be treated differently as they are more dangerous than the latter, bidi written as UTF-8 can mislead what the source contains already in (some) text editors or whatever way user looks at the source code, while when written as UCNs (\u, \u{}, \U, \N{}) it can be
[PATCH] c++, v2: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
On Wed, Aug 24, 2022 at 04:22:17PM -0400, Jason Merrill wrote: > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > Does the copyright 2005-2022 mean that this code is partly derived from some > other? Yes, I was lazy and started by copying over makeucnid.cc which also parses UnicodeData.txt. In the end, according to diff -upd -U1 make{ucnid,uname2c}.cc, there are ~180 lines in common (out of ~530 lines of makeucnid.cc), out of which is ~80 lines in the two copyrights, most of the rest are just empty lines or lines with { or } alone, beyond that #include #include #include #include #include #define NUM_CODE_POINTS 0x11 #define MAX_CODE_POINT 0x10 and /* Read UnicodeData.txt and fill in the 'decomp' table to be the decompositions of characters for which both the character decomposed and all the code points in the decomposition are valid for some supported language version, and the 'all_decomp' table to be the decompositions of all characters without those constraints. */ static void { if (!f) for (;;) { char line[256]; char *l; if (!fgets (line, sizeof (line), f)) break; codepoint = strtoul (line, , 16); if (l == line || *l != ';') if (codepoint > MAX_CODE_POINT) do { } while (*l != ';'); { } } if (ferror (f)) fclose (f); } are the common lines close to each other (and whole write_copyright function). Dunno if with that I could use just 2022 copyright or not. > > + /* We don't know what the next letter will be. > > +It could be ISALNUM, then we are supposed > > +to omit it, or it could be a space and then > > +we should not omit it and need to compare it. > > +Fortunately the only 3 names with hyphen > > +followed by non-letter are > > +U+0F0A TIBETAN MARK BKA- SHOG YIG MGO > > +U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN > > +U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN > > +and makeuname2c.cc verifies this. > > +Furthermore, prefixes of NR2 generated > > +ranges all end with a hyphen, but the generated > > +part is then followed by alpha-numeric. > > +So, let's just assume that - at the end of > > +key is always followed by alphanumeric and > > +so should be omitted. */ > > Let's mention that makeuname2c.cc verifies this property. I had "and makeuname2c.cc verifies this." there already a few lines before, but I agree it is better to move that to the end. > > + for (j = start; j < end; j++) > > + { > > + /* Actually strlen, but we know strlen () <= 3. */ > > Is this comment saying that you're using a loop instead of calling strlen > because you know the result will be small? That seems an odd choice. Yes, but perhaps it is a micro-optimization and maybe the Korean characters will not be used that much that it isn't worth it. Our optimizers certainly aren't able to figure out that when strlen is called on an array element with size 4 that calling library function isn't the best idea. The string lengths are 0 in 3%, 1 in 44%, 2 in 47% and 3 in 6% of cases. At least on x86_64 when I just use this_len = strlen (hangul_syllables[j]); it calls the library routine. Changed to this_len = strlen (hangul_syllables[j]); > > + /* Try to do a loose name lookup according to > > +Unicode loose matching rule UAX44-LM2. > > Maybe factor the loose lookup into a separate function? Good idea. > > + bidi::kind kind; > > + if (buffer->cur[-1] == 'N') > > + kind = get_bidi_named (pfile, buffer->cur, ); > > + else > > + kind = get_bidi_ucn (pfile, buffer->cur, > > +buffer->cur[-1] == 'U', ); > > Hmm, I'm surprised that we're doing bidi checking before replacing escape > characters with elements of the translation character set. So now we need > to check it three different ways. It is unfortunate, but I'm afraid it is intentional. Because after replacing the escape characters we lose the distinction between characters written as UTF-8 in the source and the escape sequences. The former need to be treated differently as they are more dangerous than the latter, bidi written as UTF-8 can mislead what the source contains already in (some) text editors or whatever way user looks at the source code, while when written as UCNs (\u, \u{}, \U, \N{}) it can be dangerous only when the program emits it at runtime unpaired. Here is incremental diff and full patch (with the huge uname2c.h generated header