Bug#952970: mksh cannot handle non-BMP characters in \Uxxxxxxxx

Mingye Wang Mon, 02 Mar 2020 07:03:52 -0800

Package: mksh
Version: 57-6

Mksh is unable to handle Unicode characters that are not in the BMP
when using the \U escape. It turns all of them into U+FFFD, the
replacement character for Unicode errors. Consider this interaction:


$ mksh -c "echo $'\U1F600'" | hexdump -C
00000000  ef bf bd 0a                                       |....|
00000004
$ mksh -c "echo $'\U01F60'" | hexdump -C
00000000  e1 bd a0 0a                                       |....|
00000004

The first result should have returned a four-byte encoding of the
non-BMP character. Instead, it shows U+FFFD (EF BF BD) plus a newline,
indicating some sort of error. The second result shows that the code
with parsing '\Uxxxxxxxx' is all right, since giving it a BMP value
produces a reasonable result.

I checked the mksh source code on this, and it appears the upstream
has this issue too. The problem here is in a homebrew utf_wctomb
function[0]. This function only handles UTF-8 up to three bytes, where
as in practice UTF-8 is capped at four bytes. Moving the current
"else" case to a "c < 0x10000" case and adding some four-byte code
will help.

  [0]: 
https://github.com/MirBSD/mksh/blob/81a25f2a80f1a1f8e8149635b9f03587cb7cc953/expr.c#L833

The same problem is repeated in the utf_mbtowc, although the author
does appear aware of the "BMP" concept here.

Regards,
Mingye Wang (Artoria2e5)

Bug#952970: mksh cannot handle non-BMP characters in \Uxxxxxxxx

Reply via email to