On 10/6/2015 5:24 AM, Sean Leonard wrote:
And, why did Unicode deem it necessary to replicate the C1 block at 0x80-0x9F, when all of the control characters (codes) were equally reachable via ESC 4/0 - 5/15? I understand why it is desirable to align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the other non-ISO-standardized 8-bit encodings got this much right: duplicating control codes is basically a waste of very precious character code real estate

Because Unicode aligns with ISO 8859-1, so that transcoding from that was a simple zero-fill to 16 bits.

8859-1 was the most widely used single byte (full 8-bit) ISO standard at the time, and making that transition easy was beneficial, both practically and politically.

Vendor standards all disagreed on the upper range, and it would not have been feasible to single out any of them. Nobody wanted to follow the IBM code page 437 (then still the most widely used single byte vendor standard).



Note, that by "then" I refer to dates earlier than the dates of the final drafts, because may of those decisions date back to earlier periods where the drafts were first developed. Also, the overloading of 0x80-0xFF by Windows did not happen all at once, earlier versions had left much of that space open, but then people realized that as long as you were still limited to 8 bits, throwing away 32 codes was an issue.

Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now), don't matter, so being "clean" didn't cost much. (Note that even for UTF-8, there's no special benefit of a value being inside that second range of 128 codes.

Finally, even if the range had not been dedicated to C1, the 32 codes would have had to be given space, because the translation into ESC sequences is not universal, so, in transcoding data you needed to have a way to retain the difference between the raw code and the ESC sequence, or your round-trip would not be lossless.

A./

Reply via email to