https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91843
Bug ID: 91843 Summary: pretty printer mangles extended characters Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: lhyatt at gmail dot com Target Milestone: --- There seem to be some issues with identifiers containing extended characters going through pretty-printer.c. For example, this test: ===== int int_π = 3; int π() { return itn_π; } ===== (note: if testing older trunk that doesn't have extended identifier support, the same behavior is seen using UCNs). In either C or C++ mode, the output is wrong: t8.cpp: In function ‘int\xcf\x80()’: t8.cpp:3:12: error: ‘itn_π’ was not declared in this scope; did you mean ‘int\xcf\x80’? 3 | return itn_π; | ^~~~~ | int_π (in C it's the same except for minor changes in the text.) So extended characters are printed 5 times, and 2 of them get mangled with hex escape codes and 3 come out OK. Of the 3 that work, 2 from diagnostic-show-locus.c are just output directly from the source, and the other one (the error: 'itn_π') is printed using %qD, which ends up in pp_c_identifier, which ends up calling pp_identifier in pretty-print.h, which calls pp_string, which does not do any hex escaping. For the two wrong ones, the code path is different for C and C++, but they end up in pretty-printer.c being processed as a %qs directive either way. (BTW, incidental to this bug report, the "did you mean" part is missing a call to identifier_to_locale(). But that isn't the reason it gets misprinted.) It seems there are lots of code paths that may end up printing an identifier via %qs, so I tried to look at this common element, and the situation seems straightforward enough, but I don't understand why it is the way it is, so I'm not sure what's the correct fix. The source of the issue is that the %qs seems to apply quoting in two unrelated senses... It surrounds the string with quotation marks, and it also prints the string with pp_quoted_string() instead of pp_string(). The pp_quoted_string() then applies the hex escape to all non-printable bytes it comes across. Seems there would be a few options: -Change %qs to only surround with quotes, not also do hex escapes. This is a simple one-line fix but I am not sure why it does this hex escaping or if it's still necessary for other use cases. -Maybe there is some alternative to %qs that is already there that's supposed to be used for this, and needs to be added in various places? This test case reveals two of them, but there must be others among 2000 different uses of %qs, so not sure about this... -Change pp_quoted_string() to check if the bytes it would escape form a valid UTF-8 sequence, in which case, don't escape them. That also seems relatively simple and would handle all uses cases of %qs wherever they may be. I am happy to work on it but not sure how to decide on the best path. Thanks! -Lewis