https://bugs.kde.org/show_bug.cgi?id=363269

--- Comment #26 from Axel Kellermann <axel.kellerm...@gmx.de> ---
I had another look at the lexer and I think I pinned down the problem with
UTF-16 files in cmListFileLexer.c. Scanning the files works fine, but copying
the scanned content into the token structure in cmListFileLexerSetToken() is
where things go wrong. The code that copies the content uses the functions
strcpy() and strdup() that are meant to be used only with zero-terminated
character arrays. As we handle two-byte characters, where the most significant
byte can be zero (see e.g. letter 'h' with 0x0068), it's possible that the text
to be copied to token->text contains zero-bytes mid-string. In that case
strdup() doesn't do what it's intended to do. It interprets the buffer as
zero-terminated string and only duplicates it up to the first occurence of
'\0'. At the same time the original buffer size is stored in token->length.
This leads to out-of-bounds memory accesses later on.

I attached a simple UTF-16 file that reliably triggers the problem for me
(363269_repro.txt) and a proposal for a fix that replaces the string functions
with mallocs/memcpys (363269_proposal.patch). Maybe someone with more
experience with the cmake parser/lexer could have a look at it.

Related questions: The patch fixes the crashes/ASAN aborts for me, but is the
cmake parser really handling UTF-16 files correctly? Functions like
cmListFileLexer_BOM() imply that it can handle all kinds of UTF formats, but at
the same time the whole code seems to imply that it only works on zero
terminated char arrays. Do we possibly need to update to a newer version of the
lexer (which is generated from external sources, right?)?

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to