https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91843

            Bug ID: 91843
           Summary: pretty printer mangles extended characters
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lhyatt at gmail dot com
  Target Milestone: ---

There seem to be some issues with identifiers containing extended characters
going through pretty-printer.c. For example, this test:

=====
int int_π = 3;
int π() {
    return itn_π;
}
=====

(note: if testing older trunk that doesn't have extended identifier support,
the same behavior is seen using UCNs).

In either C or C++ mode, the output is wrong:

t8.cpp: In function ‘int\xcf\x80()’:
t8.cpp:3:12: error: ‘itn_π’ was not declared in this scope; did you mean
‘int\xcf\x80’?
    3 |     return itn_π;
      |            ^~~~~
      |            int_π

(in C it's the same except for minor changes in the text.)

So extended characters are printed 5 times, and 2 of them get mangled with hex
escape codes and 3 come out OK. Of the 3 that work, 2 from
diagnostic-show-locus.c are just output directly from the source, and the other
one (the error: 'itn_π') is printed using %qD, which ends up in
pp_c_identifier, which ends up calling pp_identifier in pretty-print.h, which
calls pp_string, which does not do any hex escaping.

For the two wrong ones, the code path is different for C and C++, but they end
up in pretty-printer.c being processed as a %qs directive either way.

(BTW, incidental to this bug report, the "did you mean" part is missing a call
to identifier_to_locale(). But that isn't the reason it gets misprinted.)

It seems there are lots of code paths that may end up printing an identifier
via %qs, so I tried to look at this common element, and the situation seems
straightforward enough, but I don't understand why it is the way it is, so I'm
not sure what's the correct fix. The source of the issue is that the %qs seems
to apply quoting in two unrelated senses... It surrounds the string with
quotation marks, and it also prints the string with pp_quoted_string() instead
of pp_string(). The pp_quoted_string() then applies the hex escape to all
non-printable bytes it comes across. Seems there would be a few options:

-Change %qs to only surround with quotes, not also do hex escapes. This is a
simple one-line fix but I am not sure why it does this hex escaping or if it's
still necessary for other use cases.

-Maybe there is some alternative to %qs that is already there that's supposed
to be used for this, and needs to be added in various places? This test case
reveals two of them, but there must be others among 2000 different uses of %qs,
so not sure about this...

-Change pp_quoted_string() to check if the bytes it would escape form a valid
UTF-8 sequence, in which case, don't escape them. That also seems relatively
simple and would handle all uses cases of %qs wherever they may be.


I am happy to work on it but not sure how to decide on the best path. Thanks!

-Lewis

Reply via email to