HI On Wed, Mar 25, 2026 at 9:35 AM Markus Armbruster <[email protected]> wrote: > > Marc-André Lureau <[email protected]> writes: > > > The text console receives bytes that may be UTF-8 encoded (e.g. from > > a guest running a modern distro), but currently treats each byte as a > > raw character index into the VGA/CP437 font, producing garbled output > > for any multi-byte sequence. > > > > Add a proper UTF-8 decoder using Bjoern Hoehrmann's DFA. > > The DFA inherently rejects overlong encodings, surrogates, and > > codepoints above U+10FFFF. Completed codepoints are then mapped to > > CP437, unmappable characters are displayed as '?'. > > > > Signed-off-by: Marc-André Lureau <[email protected]> > > Have you considered the decoder in util/unicode.c? Do we need two > decoders, or could we replace one by the other? >
Oh! I missed it, I should definitely try to use it. > There's a mad UTF-8 test suite buried in tests/unit/check-qjson.c > derived from Markus Kuhn's UTF-8 decoder capability and stress test at > <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>. How does > this decoder do on these tests? > According to the author original article, it should pass https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
