On Fri, May 24, 2024 at 3:31 AM Peter Bex <pe...@more-magic.net> wrote:

- It encodes how many bytes to use in the first byte's leading bit,

>   leading three bits, leading four bits or leading five bits depending
>   on the length.
>
> This latter property is extra annoying because you can't just extract
> the length from the first byte - you have to scan the first bit to
> decide what to do next.  Then, you scan the second and third bit etc.
>

That's not actually true.  You can use a table of 128 entries with one
single-byte entry for each possible value of the first byte, specifyfing
the length of the UTF-8 value.  So table entries 0 to 127 have value 1,
etc.  Entries that aren't valid UTF-8 leading bytes, such as 255, have 0 in
the table.

Reply via email to