Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Richard Wordingham via Unicode Wed, 31 May 2017 13:12:12 -0700

On Wed, 31 May 2017 17:43:08 +0000
Shawn Steele via Unicode <[email protected]> wrote:


> There also appears to be a special weight given to
> non-minimally-encoded sequences.  It would seem to me that none of
> these illegal sequences should appear in practice, so we have either:

<snip>

> I do not understand the energy being invested in a case that
> shouldn't happen, especially in a case that is a subset of all the
> other bad cases that could happen.

That's not the motivation for my using a structurally based approach.
I want to expend as little energy as possible, both in thought (Keep
It Simple, Stupid) and in machine cycles, in catering for these
overlong/non-scalar value cases. I have to cater for indisputably
illegal truncated sequences, but for the rest of it I optimise for the
conformant case. If I'm extracting scalar values, I calculate the
scalar value and then check that it's legal. If I'm advancing through a
string, I just advance by the requisite number of trailing bytes.
UTF-8 is simple in concept, and I try to follow that simplicity.  A
state machine overcomplicates it.

Moroever, if I want to handle CESU-8 or U+0000 as opposed to a sentinel
null, it is easy to add special case logic to a scalar value extractor.

> 
> -Shawn 
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to