Package: mksh Version: 57-6 Mksh is unable to handle Unicode characters that are not in the BMP when using the \U escape. It turns all of them into U+FFFD, the replacement character for Unicode errors. Consider this interaction:
$ mksh -c "echo $'\U1F600'" | hexdump -C 00000000 ef bf bd 0a |....| 00000004 $ mksh -c "echo $'\U01F60'" | hexdump -C 00000000 e1 bd a0 0a |....| 00000004 The first result should have returned a four-byte encoding of the non-BMP character. Instead, it shows U+FFFD (EF BF BD) plus a newline, indicating some sort of error. The second result shows that the code with parsing '\Uxxxxxxxx' is all right, since giving it a BMP value produces a reasonable result. I checked the mksh source code on this, and it appears the upstream has this issue too. The problem here is in a homebrew utf_wctomb function[0]. This function only handles UTF-8 up to three bytes, where as in practice UTF-8 is capped at four bytes. Moving the current "else" case to a "c < 0x10000" case and adding some four-byte code will help. [0]: https://github.com/MirBSD/mksh/blob/81a25f2a80f1a1f8e8149635b9f03587cb7cc953/expr.c#L833 The same problem is repeated in the utf_mbtowc, although the author does appear aware of the "BMP" concept here. Regards, Mingye Wang (Artoria2e5)