> And *that* is what the specification says. The whole problem here is that
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of
> us don’t think *should*
> be considered best practice.
> Perhaps “best practice” should simply be altered to say that you *clearly
> document* your behavior
> in the case of invalid UTF-8 sequences, and that code should not rely on the
> number of U+FFFDs
> generated, rather than suggesting a behaviour?
That's what I've been suggesting.
I think we could maybe go a little further though:
* Best practice is clearly not to depend on the # of U+FFFDs generated by
another component/app. Clearly that can't be relied upon, so I think everyone
can agree with that.
* I think encouraging documentation of behavior is cool, though there are
probably low priority bugs and people don't like to read the docs in that
detail, so I wouldn't expect very much from that.
* As far as I can tell, there are two (maybe three) sane approaches to this
problem:
* Either a "maximal" emission of one U+FFFD for every byte that exists
outside of a good sequence
* Or a "minimal" version that presumes the lead byte was counting trail
bytes correctly even if the resulting sequence was invalid. In that case just
use one U+FFFD.
* And (maybe, I haven't heard folks arguing for this one) emit one
U+FFFD at the first garbage byte and then ignore the input until valid data
starts showing up again. (So you could have 1 U+FFFD for a string of a hundred
garbage bytes as long as there weren't any valid sequences within that group).
* I'd be happy if the best practice encouraged one of those two (or maybe
three) approaches. I think an approach that called rand() to see how many
U+FFFDs to emit when it encountered bad data is fair to discourage.
-Shawn