On Fri, May 24, 2024 at 3:31 AM Peter Bex <pe...@more-magic.net> wrote:
- It encodes how many bytes to use in the first byte's leading bit, > leading three bits, leading four bits or leading five bits depending > on the length. > > This latter property is extra annoying because you can't just extract > the length from the first byte - you have to scan the first bit to > decide what to do next. Then, you scan the second and third bit etc. > That's not actually true. You can use a table of 128 entries with one single-byte entry for each possible value of the first byte, specifyfing the length of the UTF-8 value. So table entries 0 to 127 have value 1, etc. Entries that aren't valid UTF-8 leading bytes, such as 255, have 0 in the table.