On 18 May 2017, at 01:04, Philippe Verdy via Unicode <unicode@unicode.org> 
wrote:
> 
> I find intriguating that the update intends to enforce the decoding of the 
> **shortest** sequences, but now wants to treat **maximal sequences** as a 
> single unit with arbitrary length. UTF-8 was designed to work only with some 
> state machines that would NEVER need to parse more than 4 bytes.

This won’t change.  You still don’t need to parse more than four bytes.  In 
fact, you don’t need to do *anything*, even if your implementation doesn’t 
match the proposal, because *it’s only a recommendation*.  But if you did 
choose to do something, you *still* don’t need to scan arbitrary numbers of 
bytes.

> For me, as soon as the first byte encountered is invalid, the current 
> sequence should be stopped there and treated as error (replaced by U+FFFD is 
> replacement is enabled instead of returning an error or throwing an 
> exception),

This is still essentially true under the proposal; the only difference is that 
instead of being a clever dick and taking account of the valid *code point* 
ranges while doing this in order to ban certain trailing bytes given the values 
of their predecessors, you allow any trailing byte, and only worry about 
whether the complete sequence represents a valid code point or is over-long 
once you’ve finished reading it.  You never need to read more than four bytes 
under the new proposal, because the lead byte tells you how many to expect, and 
you’d still stop and instantly replace with U+FFFD if you see a byte outside 
the 0x80-0xbf range, even if you hadn’t scanned the number of bytes the lead 
byte says to expect.

This also *does not* change the view of the underlying UTF-8 string based on 
iteration direction; you would still generate the exact same sequence of code 
points in both directions.

Kind regards,

Alastair.

--
http://alastairs-place.net


Reply via email to