On 2012/11/17 12:54, Buck Golemon wrote:
On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell<d...@ewellic.org>  wrote:

Buck Golemon wrote:

  Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
to map it to the equally-non-semantic U+81 ?

U+0081 (there are always at least four digits in this notation) just by chance doesn't have any definition. But if we take the next of the "holes" in windows-1258, 0x8D, we get "REVERSE LINE FEED". This isn't exactly non-semantic (although of course browsers and quite a bit of other software ignores that meaning).


Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data
or completely fails where it used to succeed.

There are cases where one wants to avoid as many failures as possible, at the cost of GIGO (garbage in, garbage out). Browsers are definitely in that category.

There are other cases where one wants to catch garbage early, and not let it pollute the rest of the data.


Currently there's no reasonable way for me to implement the U+0081 option
other than inventing a new "cp1252+latin1" codec, which seems undesirable.

Well, the above two cases cannot be met with one and the same codec (unless of course in the case where there are additional options that allow to switch between one and the other).


I feel like you skipped a step. The byte is 0x81 full stop. I agree that it
doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to
their unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it
goes out of its way to talk about "8-bit" control codes.

I'd say it intends to apply to any single-byte encoding with a full C1 range, or in other words, any single-byte encoding conforming to the ISO C0/G0/C1/G1 model (that's used if not defined in ISO 2022). So that would include any encoding of the ISO-8859-X family but not windows-XXXX or macintosh encodings.

In other words, the C1 range isn't just a dumping ground for cases where the conversion would fail otherwise.


Regards,   Martin.

Reply via email to