Marc-André Lureau <[email protected]> writes: > The text console receives bytes that may be UTF-8 encoded (e.g. from > a guest running a modern distro), but currently treats each byte as a > raw character index into the VGA/CP437 font, producing garbled output > for any multi-byte sequence. > > Add a proper UTF-8 decoder using Bjoern Hoehrmann's DFA. > The DFA inherently rejects overlong encodings, surrogates, and > codepoints above U+10FFFF. Completed codepoints are then mapped to > CP437, unmappable characters are displayed as '?'. > > Signed-off-by: Marc-André Lureau <[email protected]>
Have you considered the decoder in util/unicode.c? Do we need two decoders, or could we replace one by the other? There's a mad UTF-8 test suite buried in tests/unit/check-qjson.c derived from Markus Kuhn's UTF-8 decoder capability and stress test at <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>. How does this decoder do on these tests?
