https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118686
Bug ID: 118686
Summary: Poor error message for ill-formed UTF-8 sequence
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: sarif-replay
Assignee: dmalcolm at gcc dot gnu.org
Reporter: dmalcolm at gcc dot gnu.org
Target Milestone: ---
Created attachment 60308
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60308&action=edit
Malformed generated .sarif that sarif-replay doesn't handle well
I'm attaching a generated .sarif file which somehow has malformed UTF-8.
Python reports the byte offset of the problem:
/home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 517696:
invalid start byte
but sarif-replay merely says line:1 column:1
$ LD_LIBRARY_PATH=. ./sarif-replay
/home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif
/home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif:1:1:
error: ill-formed UTF-8 sequence
1 | {"$schema":
"https://docs.oasis-open.org/sarif/sarif/v2.1.0/errata01/os/schemas/sarif-schema-2.1.0.json",
| ^
Ideally should show the precise line of malformed data, and use the escaping
logic to show the bytes in the annotations to the quoted source, as per
-fdiagnostics-escape-format=bytes