On Mon, 19 Dec 2016 20:54:31 -0700 Doug Ewell <d...@ewellic.org> wrote:
> There isn't much to be gained by collapsing the bad bytes to a single > replacement character. However, doing so does remove the information > about how many bytes were invalid and that may have value to a user > in assessing how much of the document is suspect. How many bytes are invalid in the sequence F0 30 A0 B0? There might just be one bit error in the data stream. The chief advantage of collapsing comes in the simplicity of the decoding logic. The natural logic is to read the requisite number of continuation bytes, converting the whole to a codepoint value, and then check that the codepoint value is allowed in UTF-8. Obviously one also has to check that the requisite continuation bytes are present. Arguments then come down to the use or otherwise of library functions and the number of error-reporting mechanisms to be used. Richard.