On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen <[email protected]> wrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right".
Testing with that file, Python 3 and OpenJDK 8 agree with the currently-specced best-practice, too. I expect there to be other well-known implementations that comply with the currently-specced best practice, so the rationale to change the stated best practice would have to be very strong (as in: security problem with currently-stated best practice) for a change to be appropriate. -- Henri Sivonen [email protected] https://hsivonen.fi/

