Re: cp1252 decoder implementation

Martin J. Dürst Tue, 27 Nov 2012 01:57:52 -0800

On 2012/11/17 12:54, Buck Golemon wrote:

On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell<[email protected]>  wrote:

Buck Golemon wrote:

  Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and

to map it to the equally-non-semantic U+81 ?

U+0081 (there are always at least four digits in this notation) just bychance doesn't have any definition. But if we take the next of the"holes" in windows-1258, 0x8D, we get "REVERSE LINE FEED". This isn'texactly non-semantic (although of course browsers and quite a bit ofother software ignores that meaning).

Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data
or completely fails where it used to succeed.

There are cases where one wants to avoid as many failures as possible,at the cost of GIGO (garbage in, garbage out). Browsers are definitelyin that category.

There are other cases where one wants to catch garbage early, and notlet it pollute the rest of the data.

Currently there's no reasonable way for me to implement the U+0081 option
other than inventing a new "cp1252+latin1" codec, which seems undesirable.

Well, the above two cases cannot be met with one and the same codec(unless of course in the case where there are additional options thatallow to switch between one and the other).

I feel like you skipped a step. The byte is 0x81 full stop. I agree that it
doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to
their unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it
goes out of its way to talk about "8-bit" control codes.

I'd say it intends to apply to any single-byte encoding with a full C1range, or in other words, any single-byte encoding conforming to the ISOC0/G0/C1/G1 model (that's used if not defined in ISO 2022). So thatwould include any encoding of the ISO-8859-X family but not windows-XXXXor macintosh encodings.

In other words, the C1 range isn't just a dumping ground for cases wherethe conversion would fail otherwise.



Regards,   Martin.

Re: cp1252 decoder implementation

Reply via email to