On Wed, 31 May 2017 17:43:08 +0000 Shawn Steele via Unicode <unicode@unicode.org> wrote:
> There also appears to be a special weight given to > non-minimally-encoded sequences. It would seem to me that none of > these illegal sequences should appear in practice, so we have either: <snip> > I do not understand the energy being invested in a case that > shouldn't happen, especially in a case that is a subset of all the > other bad cases that could happen. That's not the motivation for my using a structurally based approach. I want to expend as little energy as possible, both in thought (Keep It Simple, Stupid) and in machine cycles, in catering for these overlong/non-scalar value cases. I have to cater for indisputably illegal truncated sequences, but for the rest of it I optimise for the conformant case. If I'm extracting scalar values, I calculate the scalar value and then check that it's legal. If I'm advancing through a string, I just advance by the requisite number of trailing bytes. UTF-8 is simple in concept, and I try to follow that simplicity. A state machine overcomplicates it. Moroever, if I want to handle CESU-8 or U+0000 as opposed to a sentinel null, it is easy to add special case logic to a scalar value extractor. > > -Shawn >