On 3/2/23 18:07, Lewis Hyatt wrote:
The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token.

BTW, the other place that has been mis-lexing identifiers is
lex_identifier_intern(), which is used to implement #pragma push_macro
and #pragma pop_macro. This does not support extended characters either.
I will add that in a subsequent patch, because it can't directly reuse the
new function, but rather needs to lex from a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix.

libcpp/ChangeLog:

        PR preprocessor/103902
        * lex.cc (identifier_diagnostics_on_lex): New function refactoring
        some common code.
        (lex_identifier_intern): Use the new function.
        (lex_identifier): Don't run identifier diagnostics here, rather let
        the call site do it when needed.
        (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
        acccordingly.
        (struct scan_id_result): New struct.
        (scan_cur_identifier): New function.
        (create_literal2): New function.
        (lit_accum::create_literal2): New function.
        (is_macro): Folded into new function...
        (maybe_ignore_udl_macro_suffix): ...here.
        (is_macro_not_literal_suffix): Folded likewise.
        (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier ().
        (lex_string): Likewise.

gcc/testsuite/ChangeLog:

        PR preprocessor/103902
        * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
        * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
        * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
        * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
---

Notes:
     Hello-
This is the updated version of the patch, incorporating feedback from Jakub
     and Jason, most recently discussed here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html Please let me know how it looks? It is simpler than before with the new
     approach. Thanks!
One thing to note. As Jason clarified for me, a usage like this: #pragma GCC poison _x
     const char * operator "" _x (const char *, unsigned long);
The space between the "" and the _x is currently allowed but will be
     deprecated in C++23. GCC currently will complain about the poisoned use of
     _x in this case, and this patch, which is just focused on handling UTF-8
     properly, does not change this. But it seems that it would be correct
     not to apply poison in this case. I can try to follow up with a patch to do
     so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
     worth it...

Right, the deprecation is intended to avoid this problem; it's fine for us to complain.

     For the time being, this patch does add a testcase for the above and xfails
     it. For the case where no space is present, which is the part touched by 
the
     present patch, existing behavior is preserved correctly and no diagnostics
     such as poison are issued for the UDL suffix. (Contrary to v1 of this
     patch.)
Thanks! bootstrap + regtested all languages on x86-64 Linux with
     no regressions.

-/* Returns true if a macro has been defined.
-   This might not work if compile with -save-temps,
-   or preprocess separately from compilation.  */
+/* Helper function to check if a string format macro, say from inttypes.h, is
+   placed touching a string literal, in which case it could be parsed as a 
C++11
+   user-defined string literal thus breaking the program.

    User-defined literals
+   outside of namespace std must start with a single underscore, so assume
+   anything of that form really is a UDL suffix.  We don't need to worry about
+   UDLs defined inside namespace std because their names are reserved, so 
cannot
+   be used as macro names in valid programs.

I'd prefer to leave this rationale comment on the _[^_] check rather than hoist it out of the function; OK with that change. Thank you very much for your persistence.

    Return TRUE if the UDL should be
+   ignored for now and preserved for potential macro expansion.  */
static bool
-is_macro(cpp_reader *pfile, const uchar *base)
+maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
+                              const uchar *suffix_begin, cpp_hashnode *node)
  {
-  const uchar *cur = base;
-  if (! ISIDST (*cur))
+  if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
+      || !cpp_macro_p (node))
      return false;
-  unsigned int hash = HT_HASHSTEP (0, *cur);
-  ++cur;
-  while (ISIDNUM (*cur))
-    {
-      hash = HT_HASHSTEP (hash, *cur);
-      ++cur;
-    }
-  hash = HT_HASHFINISH (hash, cur - base);
- cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
-                                       base, cur - base, hash, HT_NO_INSERT));
-
-  return result && cpp_macro_p (result);
+  /* Maybe raise a warning here; caller should arrange not to consume
+     the tokens.  */
+  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
+    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
+                          "invalid suffix on literal; C++11 requires a space "
+                          "between literal and string macro");
+  return true;
  }
-/* Returns true if a literal suffix does not have the expected form
-   and is defined as a macro.  */
-
-static bool
-is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
  {
-  /* User-defined literals outside of namespace std must start with a single
-     underscore, so assume anything of that form really is a UDL suffix.
-     We don't need to worry about UDLs defined inside namespace std because
-     their names are reserved, so cannot be used as macro names in valid
-     programs.  */
-  if (base[0] == '_' && base[1] != '_')
-    return false;
-  return is_macro (pfile, base);

Reply via email to