On 8/29/22 17:35, Jakub Jelinek wrote:
On Mon, Aug 29, 2022 at 05:15:26PM -0400, Jason Merrill wrote:
On 8/29/22 04:15, Jakub Jelinek wrote:
Hi!

The following patch introduces a new warning - -Winvalid-utf8 similarly
to what clang now has - to diagnose invalid UTF-8 byte sequences in
comments.  In identifiers and in string literals it should be diagnosed
already but comment content hasn't been really verified.

I'm not sure if this is enough to say P2295R6 is implemented or not.

The problem is that in the most common case, people don't use
-finput-charset= option and the sources often are UTF-8, but sometimes
could be some ASCII compatible single byte encoding where non-ASCII
characters only appear in comments.  So having the warning off by default
is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
perhaps we could make the warning on by default for C++23 and use pedwarn
instead of warning, because then the user told us explicitly that the source
is UTF-8.  From the paper I understood one of the implementation options
is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
like encodings where invalid UTF-8 characters in comments are replaced say
by spaces, where the latter could be the default and the former only
used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.

Thoughts on this?

That sounds good to me.

The pedwarn on -std=c++23 -finput-charset=UTF-8 or just documenting that
"conforming" UTF-8 is only -finput-charset=UTF-8 -Werror=invalid-utf8 ?

The former.

+static const uchar *
+_cpp_warn_invalid_utf8 (cpp_reader *pfile)
+{
+  cpp_buffer *buffer = pfile->buffer;
+  const uchar *cur = buffer->cur;
+
+  if (cur[0] < utf8_signifier
+      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+                            pfile->line_table->highest_line,
+                            CPP_BUF_COL (buffer),
+                            "invalid UTF-8 character <%x> in comment",
+                            cur[0]);
+      return cur + 1;
+    }
+  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)

Unicode table 3-7 says that the second byte is sometimes restricted to less
than this range.

That is true and I've tried to include tests for all of those cases in the
testcase and all of them get a warning.  Some of them are through:
   /* Make sure the shortest possible encoding was used.  */

   if (c <=      0x7F && nbytes > 1) return EILSEQ;
   if (c <=     0x7FF && nbytes > 2) return EILSEQ;
   if (c <=    0xFFFF && nbytes > 3) return EILSEQ;
   if (c <=  0x1FFFFF && nbytes > 4) return EILSEQ;
   if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ;
and others are through:
   /* Make sure the character is valid.  */
   if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
All I had to do outside of what one_utf8_to_cppchar already implements was:

+             if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
+                 && s <= 0x0010FFFF)

the <= 0x0010FFFF check, because one_utf8_to_cppchar as written happily
supports up to 6 bytes long UTF-8 which can encode up to 7FFFFFFF:
    00000000-0000007F   0xxxxxxx
    00000080-000007FF   110xxxxx 10xxxxxx
    00000800-0000FFFF   1110xxxx 10xxxxxx 10xxxxxx
    00010000-001FFFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    00200000-03FFFFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    04000000-7FFFFFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
while 3-7 only talks about encoding 0..D7FF and D800..10FFFF in up to 4
bytes.

I guess I should try what happens with 0x110000 and 0x7fffffff in
identifiers and string literals.

        Jakub


Reply via email to