https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105959

David Malcolm <dmalcolm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2023-01-30 00:00:00         |2023-03-16
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #7 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
Aha! - thanks for the information.

I think GCC is writing out the .sarif file in UTF-8 form regardless of the
environment on everyone's box.  The issue seems to be this line in the testcase
to check for the UTF-8 in the "snippet" output:
       { dg-final { scan-sarif-file "\"text\": \"  int
\\u6587\\u5b57\\u5316\\u3051 = " } }
that's failing somewhere within DejaGnu, presumably due to the environment
differences.

There some variation due to json::object using a hash_map for the key/value
pairs, which means (annoyingly) it outputs things in arbitrary order, leading
to non-determinism in the .sarif content.

Perhaps it's possible to express byte-level matching in Tcl?  I'll have a look.


Details
=======

The source code (gcc/testsuite/c-c++-common/diagnostic-format-sarif-file-4.c)
is indeed UTF-8 encoded; looking at the output of
./contrib/unicode/utf8-dump.py, I see this for line 7:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
   7 |   int 文字化け = *42;
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+6587  0xe6 0x96 0x87               CJK UNIFIED IDEOGRAPH-6587 文
     |   U+5B57  0xe5 0xad 0x97               CJK UNIFIED IDEOGRAPH-5B57 字
     |   U+5316  0xe5 0x8c 0x96               CJK UNIFIED IDEOGRAPH-5316 化
     |   U+3051  0xe3 0x81 0x91                       HIRAGANA LETTER KE け
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+003D            0x3d                              EQUALS SIGN =
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+0034            0x34                               DIGIT FOUR 4
     |   U+0032            0x32                                DIGIT TWO 2
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF)
(control character)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Looking at the output on my box via:
  hexdump -C testsuite/gcc/diagnostic-format-sarif-file-4.c.sarif|less
and looking for "snippet" shows:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
000005a0  3a 20 7b 22 63 6f 6e 74  65 78 74 52 65 67 69 6f  |: {"contextRegio|
000005b0  6e 22 3a 20 7b 22 73 74  61 72 74 4c 69 6e 65 22  |n": {"startLine"|
000005c0  3a 20 37 2c 20 22 73 6e  69 70 70 65 74 22 3a 20  |: 7, "snippet": |
000005d0  7b 22 74 65 78 74 22 3a  20 22 20 20 69 6e 74 20  |{"text": "  int |
000005e0  e6 96 87 e5 ad 97 e5 8c  96 e3 81 91 20 3d 20 2a  |............ = *|
000005f0  34 32 3b 5c 6e 22 7d 7d  2c 20 22 61 72 74 69 66  |42;\n"}}, "artif|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

where it's been encoded in UTF-8 as:
   e6 96 87 e5 ad 97 e5 8c  96 e3 81 91 20 3d
 which I can confirm with ./contrib/unicode/utf8-dump.py, which shows that the
snippet has been written in UTF-8 form:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+6587  0xe6 0x96 0x87               CJK UNIFIED IDEOGRAPH-6587 文
     |   U+5B57  0xe5 0xad 0x97               CJK UNIFIED IDEOGRAPH-5B57 字
     |   U+5316  0xe5 0x8c 0x96               CJK UNIFIED IDEOGRAPH-5316 化
     |   U+3051  0xe3 0x81 0x91                       HIRAGANA LETTER KE け
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+003D            0x3d                              EQUALS SIGN =
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The test case has:
  { dg-final { scan-sarif-file "\"text\": \"  int \\u6587\\u5b57\\u5316\\u3051
= " } }
      which is looking for the text of the snippet containing the unicode chars

Attachment 54658 (with md5sum 67cc5fdbee9006509aa38af635d6cf69) has this for
the snippet:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
000005f0  73 6e 69 70 70 65 74 22  3a 20 7b 22 74 65 78 74  |snippet": {"text|
00000600  22 3a 20 22 20 20 69 6e  74 20 e6 96 87 e5 ad 97  |": "  int ......|
00000610  e5 8c 96 e3 81 91 20 3d  20 2a 34 32 3b 5c 6e 22  |...... = *42;\n"|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      which is:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+6587  0xe6 0x96 0x87               CJK UNIFIED IDEOGRAPH-6587 文
     |   U+5B57  0xe5 0xad 0x97               CJK UNIFIED IDEOGRAPH-5B57 字
     |   U+5316  0xe5 0x8c 0x96               CJK UNIFIED IDEOGRAPH-5316 化
     |   U+3051  0xe3 0x81 0x91                       HIRAGANA LETTER KE け
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+003D            0x3d                              EQUALS SIGN =
     |   U+0020            0x20                                    SPACE
(separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+0034            0x34                               DIGIT FOUR 4
     |   U+0032            0x32                                DIGIT TWO 2
     |   U+003B            0x3b                                SEMICOLON ;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hence GCC is also writing out the .sarif file in UTF-8 form in that attachment,
regardless of the environment; the issue is presumably within the handling of
this directive:
       { dg-final { scan-sarif-file "\"text\": \"  int
\\u6587\\u5b57\\u5316\\u3051 = " } }

Reply via email to