Satisfactory answers, thank you very much.

Going back to doing more research. (Silence does not imply abandoning the C1 Control Pictures project; just a lot to synthesize.)

Regarding the three points U+0080, U+0081, and U+0099: the fact that Unicode defers mostly to ISO 6429 and other standards before its time (e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is not particularly urgent that those code points get Unicode names. I also do not find that their lack of definition precludes pictorial representations. In the current U+2400 block, the Standard says: "The diagonal lettering glyphs are only exemplary; alternate representations may be, and often are used in the visible display of control codes", and, Section 22.7.

I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968 (the latter is available on ECMA's website). I find it worthwhile to point out that the Transmission Controls and Format Effectors were not standardized by the time of ECMA-17:1968, but the symbols are the same nonetheless. ANSI X3.32-1973 has the standardized control names for those characters.

Sean

On 10/6/2015 6:57 AM, Philippe Verdy wrote:

2015-10-06 14:24 GMT+02:00 Sean Leonard <lists+unic...@seantek.com <mailto:lists+unic...@seantek.com>>:

        2. The Unicode code charts are (deliberately) vague about
        U+0080, U+0081,
        and U+0099. All other C1 control codes have aliases to the ISO
        6429
        set of control functions, but in ISO 6429, those three control
        codes don't
        have any assigned functions (or names).


    On 10/5/2015 3:57 PM, Philippe Verdy wrote:

        Also the aliases for C1 controls were formally registered in
        1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F
        for ISO 6429.


    If I may, I would appreciate another history lesson:
    In ISO 2022 / 6429 land, it is apparent that the C1 controls are
    mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary
    depending on what is loaded into the C1 register, but overall, it
    just seems like saving one byte.

    Why was C1 invented in the first place?


Look for the history of EBCDIC and its adaptation/conversion with ASCII compatible encodings: round trip conversion wasneeded (using a only a simple reordering of byte values, with no duplicates). EBCDIC has used many controls that were not part of C0 and were kept in the C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were only needed for ISO 2022, but ISO 6429 defines a profile where those longer sequences are not needed and even forbidden in 8-bit contexts or in contexts where aliases are undesirable and invalidated, such as security environments.

With your thoughts, I would conclude that assigning characters in the G1 set was also a duplicate, because it is reachable with a C0 "shifting" control + a position of the G0 set. In that case ISO 8859-1 or Windows 1252 was also an unneeded duplication ! And we would live today in a 7-bit only world.

C1 controls have their own identity. The 7-bit encoding using ESC is just a hack to make them fit in 7-bit and it only works where the ESC control is assumed to play this function according to ISO 2022, ISO 6429, or other similar old 7-bit protocols such as Videotext (which was widely used in France with the free "Minitel" terminal, long before the introduction of the Internet to the general public around 1992-1995).

Today Videotext is definitely dead (the old call numbers for this slow service are now definitely defunct, the Minitels are recycled wastes, they stopped being distributed and replaced by applications on PC connected to the Internet, but now all the old services are directly on the internet and none of them use 7-bit encodings for their HTML pages, or their mobile applications). France has also definitely abandoned its old French version of ISO 646, there are no longer any printer supporting versions of ISO 646 other than ASCII, but they still support various 8-bit encodings.

7-bit encodings are things of the past (they were only justified at times where communication links were slow and generated lots of transmission errors, and the only implemented mecanism to check them was to use a single parity bit per character. Today we transmit long datagrams and prefer using checks codes for the whole (such as CRC, or autocorrecting codes). 8-bit encodings are much easier and faster to process for transmitting not just text but also binary data.

Let's forget the 7-bit world definitely. We have also abandonned the old UTF-7 in Unicode ! I've not seen it used anywhere except in a few old emails sent at end of the 90's, because many mail servers were still not 8-bit clean and silently transformed non-ASCII bytes in unpredictable ways or using unspecified encodings, or just siltently dropped the high bit, assuming it was just a parity bit : at that time, emails were not sent with SMTP, but with the old UUCP protocol and could take weeks to be delivered to the final recipient, as there was still no global routing infrastructure and many hops were necessary via non-permanent modem links. My opinion of UTF-7 is that it was just a temporary and experimental solution to help system admins and developers adopt the new UCS, including for their old 7-bit environments.


On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote:
On 10/6/2015 5:24 AM, Sean Leonard wrote:
And, why did Unicode deem it necessary to replicate the C1 block at 0x80-0x9F, when all of the control characters (codes) were equally reachable via ESC 4/0 - 5/15? I understand why it is desirable to align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the other non-ISO-standardized 8-bit encodings got this much right: duplicating control codes is basically a waste of very precious character code real estate

Because Unicode aligns with ISO 8859-1, so that transcoding from that was a simple zero-fill to 16 bits.

8859-1 was the most widely used single byte (full 8-bit) ISO standard at the time, and making that transition easy was beneficial, both practically and politically.

Vendor standards all disagreed on the upper range, and it would not have been feasible to single out any of them. Nobody wanted to follow the IBM code page 437 (then still the most widely used single byte vendor standard).


Note, that by "then" I refer to dates earlier than the dates of the final drafts, because may of those decisions date back to earlier periods where the drafts were first developed.Also, the overloading of 0x80-0xFF by Windows did not happen all at once, earlier versions had left much of that space open, but then people realized that as long as you were still limited to 8 bits, throwing away 32 codes was an issue.

Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now), don't matter, so being "clean" didn't cost much. (Note that even for UTF-8, there's no special benefit of a value being inside that second range of 128 codes.

Finally, even if the range had not been dedicated to C1, the 32 codes would have had to be given space, because the translation into ESC sequences is not universal, so, in transcoding data you needed to have a way to retain the difference between the raw code and the ESC sequence, or your round-trip would not be lossless.

A./

Reply via email to