Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Thu, 01 Jun 2017 01:19:14 -0700

On 31 May 2017, at 20:42, Shawn Steele via Unicode <unicode@unicode.org> wrote:
> 
>> And *that* is what the specification says.  The whole problem here is that 
>> someone elevated
>> one choice to the status of “best practice”, and it’s a choice that some of 
>> us don’t think *should*
>> be considered best practice.
> 
>> Perhaps “best practice” should simply be altered to say that you *clearly 
>> document* your behavior
>> in the case of invalid UTF-8 sequences, and that code should not rely on the 
>> number of U+FFFDs 
>> generated, rather than suggesting a behaviour?
> 
> That's what I've been suggesting.
> 
> I think we could maybe go a little further though:
> 
> * Best practice is clearly not to depend on the # of U+FFFDs generated by 
> another component/app.  Clearly that can't be relied upon, so I think 
> everyone can agree with that.
> * I think encouraging documentation of behavior is cool, though there are 
> probably low priority bugs and people don't like to read the docs in that 
> detail, so I wouldn't expect very much from that.
> * As far as I can tell, there are two (maybe three) sane approaches to this 
> problem:
>       * Either a "maximal" emission of one U+FFFD for every byte that exists 
> outside of a good sequence 
>       * Or a "minimal" version that presumes the lead byte was counting trail 
> bytes correctly even if the resulting sequence was invalid.  In that case 
> just use one U+FFFD.
>       * And (maybe, I haven't heard folks arguing for this one) emit one 
> U+FFFD at the first garbage byte and then ignore the input until valid data 
> starts showing up again.  (So you could have 1 U+FFFD for a string of a 
> hundred garbage bytes as long as there weren't any valid sequences within 
> that group).
> * I'd be happy if the best practice encouraged one of those two (or maybe 
> three) approaches.  I think an approach that called rand() to see how many 
> U+FFFDs to emit when it encountered bad data is fair to discourage.


Agreed.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to