IMO this isn't worth the effort being spent on it.  MOST encodings have all 
sorts of interesting quirks, variations, OEM or App specific behavior, etc.  
These are a few code points that haven't really caused much confusion, and 
other code pages are much more confusing (like the CJK ones in particular).

I'd be much happier spending effort on getting apps to UTF-8 than trying to 
resolve esoteric quirks of legacy encodings.  Even if you get that CP perfect, 
someone's gonna enter any of a bajillion characters on that page's HTML 5 web 
form that'll turn into ? at best.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Buck Golemon
Sent: Saturday, November 17, 2012 8:35 AM
To: verd...@wanadoo.fr
Cc: Doug Ewell; unicode
Subject: Re: cp1252 decoder implementation

> So don't say that there are one-for-one equivalences.

I was just quoting this section of the standard: 
http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

> There is a simple, one-to-one mapping between 7-bit (and 8-bit) control codes 
> and the Unicode control codes: every 7-bit (or 8-bit) control code is 
> numerically equal to its corresponding Unicode code point.

A one-to-one equivalency between bytes and unicode-points is exactly what is 
specified here, limited to the domain of "8-bit control codes".

On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy 
<verd...@wanadoo.fr<mailto:verd...@wanadoo.fr>> wrote:
If you are thinking about "byte values" you are working at the encoding scheme 
level (in fact another lower level which defines a protocol presentation layer, 
e.g. "transport syntaxes" in MIME). Unicode codepoints are conceptually not an 
encoding scheme, just a coded character set (independant of the encoding 
scheme).

Separate the levels of abstraction and you'll be much more fine. Forget the 
apparent homonymies that exist between distinct layers of abstraction and use 
each standard in what it is designed for (including the Unicode 
"character/glyph model" which is not defining an encoding scheme).

So don't say that there are one-for-one equivalences. This is wrong : the 
adaptation layer must exist between abstraction levels and between separate 
standards, but the Unicode standard does not specify them completely (with the 
only exception of standard UTF encodings schemes, which is just one possible 
adaptation across some abstraction levels, but is not made to adapt alone to 
other standards than what is in the Unicode standard itself).


2012/11/17 Buck Golemon <b...@yelp.com<mailto:b...@yelp.com>>
On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell 
<d...@ewellic.org<mailto:d...@ewellic.org>> wrote:
Buck Golemon wrote:
Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
to map it to the equally-non-semantic U+81 ?

This would allow systems that follow the html5 standard and use cp1252
in place of latin1 to continue to be binary-faithful and reversible.

This isn't quite as black-and-white as the question about Latin-1. If you are 
targeting HTML5, you are probably safe in treating an incoming 0x81 (for 
example) as either U+0081 or U+FFFD, or throwing some kind of error.

Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data or 
completely fails where it used to succeed.
Currently there's no reasonable way for me to implement the U+0081 option other 
than inventing a new "cp1252+latin1" codec, which seems undesirable.

HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer 
matters what the byte is in 8859-1.

I feel like you skipped a step. The byte is 0x81 full stop. I agree that it 
doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to their 
unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it 
goes out of its way to talk about "8-bit" control codes.


Reply via email to