HI

On Wed, Mar 25, 2026 at 9:35 AM Markus Armbruster <[email protected]> wrote:
>
> Marc-André Lureau <[email protected]> writes:
>
> > The text console receives bytes that may be UTF-8 encoded (e.g. from
> > a guest running a modern distro), but currently treats each byte as a
> > raw character index into the VGA/CP437 font, producing garbled output
> > for any multi-byte sequence.
> >
> > Add a proper UTF-8 decoder using Bjoern Hoehrmann's DFA.
> > The DFA inherently rejects overlong encodings, surrogates, and
> > codepoints above U+10FFFF.  Completed codepoints are then mapped to
> > CP437, unmappable characters are displayed as '?'.
> >
> > Signed-off-by: Marc-André Lureau <[email protected]>
>
> Have you considered the decoder in util/unicode.c?  Do we need two
> decoders, or could we replace one by the other?
>

Oh! I missed it, I should definitely try to use it.

> There's a mad UTF-8 test suite buried in tests/unit/check-qjson.c
> derived from Markus Kuhn's UTF-8 decoder capability and stress test at
> <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>.  How does
> this decoder do on these tests?
>

According to the author original article, it should pass
https://bjoern.hoehrmann.de/utf-8/decoder/dfa/


Reply via email to