On 31 May 2017, at 20:42, Shawn Steele via Unicode <unicode@unicode.org> wrote: > >> And *that* is what the specification says. The whole problem here is that >> someone elevated >> one choice to the status of “best practice”, and it’s a choice that some of >> us don’t think *should* >> be considered best practice. > >> Perhaps “best practice” should simply be altered to say that you *clearly >> document* your behavior >> in the case of invalid UTF-8 sequences, and that code should not rely on the >> number of U+FFFDs >> generated, rather than suggesting a behaviour? > > That's what I've been suggesting. > > I think we could maybe go a little further though: > > * Best practice is clearly not to depend on the # of U+FFFDs generated by > another component/app. Clearly that can't be relied upon, so I think > everyone can agree with that. > * I think encouraging documentation of behavior is cool, though there are > probably low priority bugs and people don't like to read the docs in that > detail, so I wouldn't expect very much from that. > * As far as I can tell, there are two (maybe three) sane approaches to this > problem: > * Either a "maximal" emission of one U+FFFD for every byte that exists > outside of a good sequence > * Or a "minimal" version that presumes the lead byte was counting trail > bytes correctly even if the resulting sequence was invalid. In that case > just use one U+FFFD. > * And (maybe, I haven't heard folks arguing for this one) emit one > U+FFFD at the first garbage byte and then ignore the input until valid data > starts showing up again. (So you could have 1 U+FFFD for a string of a > hundred garbage bytes as long as there weren't any valid sequences within > that group). > * I'd be happy if the best practice encouraged one of those two (or maybe > three) approaches. I think an approach that called rand() to see how many > U+FFFDs to emit when it encountered bad data is fair to discourage.
Agreed. Kind regards, Alastair. -- http://alastairs-place.net