https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105959
David Malcolm <dmalcolm at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2023-01-30 00:00:00 |2023-03-16 Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #7 from David Malcolm <dmalcolm at gcc dot gnu.org> --- Aha! - thanks for the information. I think GCC is writing out the .sarif file in UTF-8 form regardless of the environment on everyone's box. The issue seems to be this line in the testcase to check for the UTF-8 in the "snippet" output: { dg-final { scan-sarif-file "\"text\": \" int \\u6587\\u5b57\\u5316\\u3051 = " } } that's failing somewhere within DejaGnu, presumably due to the environment differences. There some variation due to json::object using a hash_map for the key/value pairs, which means (annoyingly) it outputs things in arbitrary order, leading to non-determinism in the .sarif content. Perhaps it's possible to express byte-level matching in Tcl? I'll have a look. Details ======= The source code (gcc/testsuite/c-c++-common/diagnostic-format-sarif-file-4.c) is indeed UTF-8 encoded; looking at the output of ./contrib/unicode/utf8-dump.py, I see this for line 7: VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV 7 | int 文字化け = *42; | U+0020 0x20 SPACE (separator) | U+0020 0x20 SPACE (separator) | U+0069 0x69 LATIN SMALL LETTER I i | U+006E 0x6e LATIN SMALL LETTER N n | U+0074 0x74 LATIN SMALL LETTER T t | U+0020 0x20 SPACE (separator) | U+6587 0xe6 0x96 0x87 CJK UNIFIED IDEOGRAPH-6587 文 | U+5B57 0xe5 0xad 0x97 CJK UNIFIED IDEOGRAPH-5B57 字 | U+5316 0xe5 0x8c 0x96 CJK UNIFIED IDEOGRAPH-5316 化 | U+3051 0xe3 0x81 0x91 HIRAGANA LETTER KE け | U+0020 0x20 SPACE (separator) | U+003D 0x3d EQUALS SIGN = | U+0020 0x20 SPACE (separator) | U+002A 0x2a ASTERISK * | U+0034 0x34 DIGIT FOUR 4 | U+0032 0x32 DIGIT TWO 2 | U+003B 0x3b SEMICOLON ; | U+000A 0x0a LINE FEED (LF) (control character) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Looking at the output on my box via: hexdump -C testsuite/gcc/diagnostic-format-sarif-file-4.c.sarif|less and looking for "snippet" shows: VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV 000005a0 3a 20 7b 22 63 6f 6e 74 65 78 74 52 65 67 69 6f |: {"contextRegio| 000005b0 6e 22 3a 20 7b 22 73 74 61 72 74 4c 69 6e 65 22 |n": {"startLine"| 000005c0 3a 20 37 2c 20 22 73 6e 69 70 70 65 74 22 3a 20 |: 7, "snippet": | 000005d0 7b 22 74 65 78 74 22 3a 20 22 20 20 69 6e 74 20 |{"text": " int | 000005e0 e6 96 87 e5 ad 97 e5 8c 96 e3 81 91 20 3d 20 2a |............ = *| 000005f0 34 32 3b 5c 6e 22 7d 7d 2c 20 22 61 72 74 69 66 |42;\n"}}, "artif| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ where it's been encoded in UTF-8 as: e6 96 87 e5 ad 97 e5 8c 96 e3 81 91 20 3d which I can confirm with ./contrib/unicode/utf8-dump.py, which shows that the snippet has been written in UTF-8 form: VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV | U+0069 0x69 LATIN SMALL LETTER I i | U+006E 0x6e LATIN SMALL LETTER N n | U+0074 0x74 LATIN SMALL LETTER T t | U+0020 0x20 SPACE (separator) | U+6587 0xe6 0x96 0x87 CJK UNIFIED IDEOGRAPH-6587 文 | U+5B57 0xe5 0xad 0x97 CJK UNIFIED IDEOGRAPH-5B57 字 | U+5316 0xe5 0x8c 0x96 CJK UNIFIED IDEOGRAPH-5316 化 | U+3051 0xe3 0x81 0x91 HIRAGANA LETTER KE け | U+0020 0x20 SPACE (separator) | U+003D 0x3d EQUALS SIGN = ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The test case has: { dg-final { scan-sarif-file "\"text\": \" int \\u6587\\u5b57\\u5316\\u3051 = " } } which is looking for the text of the snippet containing the unicode chars Attachment 54658 (with md5sum 67cc5fdbee9006509aa38af635d6cf69) has this for the snippet: VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV 000005f0 73 6e 69 70 70 65 74 22 3a 20 7b 22 74 65 78 74 |snippet": {"text| 00000600 22 3a 20 22 20 20 69 6e 74 20 e6 96 87 e5 ad 97 |": " int ......| 00000610 e5 8c 96 e3 81 91 20 3d 20 2a 34 32 3b 5c 6e 22 |...... = *42;\n"| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ which is: VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV | U+0069 0x69 LATIN SMALL LETTER I i | U+006E 0x6e LATIN SMALL LETTER N n | U+0074 0x74 LATIN SMALL LETTER T t | U+0020 0x20 SPACE (separator) | U+6587 0xe6 0x96 0x87 CJK UNIFIED IDEOGRAPH-6587 文 | U+5B57 0xe5 0xad 0x97 CJK UNIFIED IDEOGRAPH-5B57 字 | U+5316 0xe5 0x8c 0x96 CJK UNIFIED IDEOGRAPH-5316 化 | U+3051 0xe3 0x81 0x91 HIRAGANA LETTER KE け | U+0020 0x20 SPACE (separator) | U+003D 0x3d EQUALS SIGN = | U+0020 0x20 SPACE (separator) | U+002A 0x2a ASTERISK * | U+0034 0x34 DIGIT FOUR 4 | U+0032 0x32 DIGIT TWO 2 | U+003B 0x3b SEMICOLON ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Hence GCC is also writing out the .sarif file in UTF-8 form in that attachment, regardless of the environment; the issue is presumably within the handling of this directive: { dg-final { scan-sarif-file "\"text\": \" int \\u6587\\u5b57\\u5316\\u3051 = " } }