2017-05-16 20:50 GMT+02:00 Shawn Steele <[email protected]>:
> But why change a recommendation just because it “feels like”. As you > said, it’s just a recommendation, so if that really annoyed someone, they > could do something else (eg: they could use a single FFFD). > > > > If the recommendation is truly that meaningless or arbitrary, then we just > get into silly discussions of “better” that nobody can really answer. > > > > Alternatively, how about “one or more FFFDs?” for the recommendation? > > > > To me it feels very odd to perhaps require writing extra code to detect an > illegal case. The “best practice” here should maybe be “one or more FFFDs, > whatever makes your code faster”. > Faster ok, privided this does not break other uses, notably for random access within strings, where UTF-8 is designed to allow searching backward on a limited number of bytes (maximum 3) in order to find the leading byte, and then check its value: - if it's not found, return back to the initial position and amke the next access return U+FFFD to signal the error of position: this trailing byte is part of an ill-formed sequence, and for coherence, any further trailine bytes fouind after it will **also** return U+FFFD to be coherent (because these other trailing bytes may also be found bby random access to them. - it the leading byte is found backward ut does not match the expected number of trailing bytes after it, return back to the initial random position where you'll return also U+FFFD. This means that the initial leading byte (part of the ill-formed sequence) must also return a separate U+FFFD, given that each following trailing byte will return U+FFFD isolately when accessing to them. If we want coherent decoding with text handling promitives allowing random access with encoded sequences, there's no other choice than treating EACH byte part of the ill-formed sequence as individual errors mapped to the same replacement code point (U+FFFD if that is what is chosen, but these APIs could as well specify annother replacement character or could eventually return a non-codepoint if the API return value is not restricted to only valid codepoints (for example the replacement could be a negative value whose absolute value matches the invalid code unit, or some other invalid code unit outside the valid range for code points with scalar values: isolated surrogates in UTF-16 for example could be returned as is, or made negative either by returning its opposite or by setting (or'ing) the most significant bit of the return value). The problem will arise when you need to store the replacement values if the internal backing store is limited to 16-bit code units or 8-bit code units: this internal backing store may use its own internal extension of standard UTF's, including the possibility of encoding NULLs as C0,80 (like what Java does with its "modified UTF-8 internal encoding used in its compiled binary classes and serializations), or internally using isolated trailing surrogates to store illformed UTF-8 input by or'ing these bytes with 0xDC00 that will be returned as an code point with no valid scalar value. For internally representing illformed UTF-16 sequences, there's no need to change anything. For internally representing ill-formed UTF-32 sequences (in fact limited to one 32-bitcode unit), with a 16bit internal backing store you may need to store 3 16bit values (three isolated trailing surrogates). For internally representing ill formed UTF-32 in an 8 bit backing store, you could use 0xC1 followed by 5 five trailing bytes (each one storing 7 bits of the initial ill-formed code unit from the UTF-32 input). What you'll do in the internal backing store will not be exposed to your API which will just return either valide codepoints with valid scalar values, or values outside the two valid subranges (so it could possibly negative values, or isolated trailing surrogates). That backing store can also substitute some valid input causing problems (such as NULLs) using 0xC0 plus another byte, that sequence being unexposed by your API which will still be able to return the expected codepoints (but with the minor caveat that the total number of returned codepoints will not match the actual size allocated for the internal backing store (that applications using that API won't even need to know how it is internally represented). In other words: any private extensions are possible internally, but it is possible to isolate it within a blackboxing API which will still be able to chose how to represent the input text (it may as well use a zlib-compressed backing store, or some stateless Huffmann compression based on a static statistic table configured and stored elsewhere, intiialized when you first instantiate your API).

