https://bugs.kde.org/show_bug.cgi?id=363269
--- Comment #26 from Axel Kellermann <axel.kellerm...@gmx.de> --- I had another look at the lexer and I think I pinned down the problem with UTF-16 files in cmListFileLexer.c. Scanning the files works fine, but copying the scanned content into the token structure in cmListFileLexerSetToken() is where things go wrong. The code that copies the content uses the functions strcpy() and strdup() that are meant to be used only with zero-terminated character arrays. As we handle two-byte characters, where the most significant byte can be zero (see e.g. letter 'h' with 0x0068), it's possible that the text to be copied to token->text contains zero-bytes mid-string. In that case strdup() doesn't do what it's intended to do. It interprets the buffer as zero-terminated string and only duplicates it up to the first occurence of '\0'. At the same time the original buffer size is stored in token->length. This leads to out-of-bounds memory accesses later on. I attached a simple UTF-16 file that reliably triggers the problem for me (363269_repro.txt) and a proposal for a fix that replaces the string functions with mallocs/memcpys (363269_proposal.patch). Maybe someone with more experience with the cmake parser/lexer could have a look at it. Related questions: The patch fixes the crashes/ASAN aborts for me, but is the cmake parser really handling UTF-16 files correctly? Functions like cmListFileLexer_BOM() imply that it can handle all kinds of UTF formats, but at the same time the whole code seems to imply that it only works on zero terminated char arrays. Do we possibly need to update to a newer version of the lexer (which is generated from external sources, right?)? -- You are receiving this mail because: You are watching all bug changes.