Marc-André Lureau <[email protected]> writes:

> The text console receives bytes that may be UTF-8 encoded (e.g. from
> a guest running a modern distro), but currently treats each byte as a
> raw character index into the VGA/CP437 font, producing garbled output
> for any multi-byte sequence.
>
> Add a proper UTF-8 decoder using Bjoern Hoehrmann's DFA.
> The DFA inherently rejects overlong encodings, surrogates, and
> codepoints above U+10FFFF.  Completed codepoints are then mapped to
> CP437, unmappable characters are displayed as '?'.
>
> Signed-off-by: Marc-André Lureau <[email protected]>

Have you considered the decoder in util/unicode.c?  Do we need two
decoders, or could we replace one by the other?

There's a mad UTF-8 test suite buried in tests/unit/check-qjson.c
derived from Markus Kuhn's UTF-8 decoder capability and stress test at
<http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>.  How does
this decoder do on these tests?


Reply via email to