On 1/4/10 11:44 AM, Julian Reschke wrote:
This happens to more or less match "decoding as ISO-8859-1", but not
quite.
...

Not quite?

More precisely, it happens to not quite match what browsers call ISO-8859-1, which is actually Windows-1252. And in particular, ISO-8859-1 doesn't define the behavior of the 0x7F-0x9F range, whereas byte-inflation does (mapping the range to various Unicode control character) and Windows-1252 does as well, in a different way (mapping the range to various printable Unicode characters).

It at least preserves all the information that was there and would allow
a caller to re-decode as UTF-8 as a separate step.

Yep.

Right now there is no interoperable encoding, so the best thing to do in
APIs that use character sequences instead of octets is to preserve as
much information as possible.

That seems reasonable...

It would be nice if we could find out whether anybody relies on the
current implementation. Maybe switch it back to byte inflation in
Mozilla trunk?

Mozilla trunk already does byte _inflation_ when converting from header bytes into a JavaScript string. I assume you meant to convert JavaScript strings into header bytes via dropping the high byte of each 16-bit code unit. However that fails the "preserve as much information as possible" test... In particular, as soon as any Unicode character outside the U+0000-U+00FF range is used, byte-dropping loses information.

-Boris

Reply via email to