Re: Why Nothing Ever Goes Away

Sean Leonard Fri, 09 Oct 2015 11:10:15 -0700

Satisfactory answers, thank you very much.

Going back to doing more research. (Silence does not imply abandoningthe C1 Control Pictures project; just a lot to synthesize.)

Regarding the three points U+0080, U+0081, and U+0099: the fact thatUnicode defers mostly to ISO 6429 and other standards before its time(e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is notparticularly urgent that those code points get Unicode names. I also donot find that their lack of definition precludes pictorialrepresentations. In the current U+2400 block, the Standard says: "Thediagonal lettering glyphs are only exemplary; alternate representationsmay be, and often are used in the visible display of control codes",and, Section 22.7.

I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968(the latter is available on ECMA's website). I find it worthwhile topoint out that the Transmission Controls and Format Effectors were notstandardized by the time of ECMA-17:1968, but the symbols are the samenonetheless. ANSI X3.32-1973 has the standardized control names forthose characters.


Sean

On 10/6/2015 6:57 AM, Philippe Verdy wrote:

2015-10-06 14:24 GMT+02:00 Sean Leonard <lists+unic...@seantek.com<mailto:lists+unic...@seantek.com>>:
        2. The Unicode code charts are (deliberately) vague about
        U+0080, U+0081,
        and U+0099. All other C1 control codes have aliases to the ISO
        6429
        set of control functions, but in ISO 6429, those three control
        codes don't
        have any assigned functions (or names).


    On 10/5/2015 3:57 PM, Philippe Verdy wrote:

        Also the aliases for C1 controls were formally registered in
        1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F
        for ISO 6429.


    If I may, I would appreciate another history lesson:
    In ISO 2022 / 6429 land, it is apparent that the C1 controls are
    mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary
    depending on what is loaded into the C1 register, but overall, it
    just seems like saving one byte.

    Why was C1 invented in the first place?
Look for the history of EBCDIC and its adaptation/conversion withASCII compatible encodings: round trip conversion wasneeded (using aonly a simple reordering of byte values, with no duplicates). EBCDIChas used many controls that were not part of C0 and were kept in theC1 set. Ignore the 7-bit compatiblity encoding using pairs, they wereonly needed for ISO 2022, but ISO 6429 defines a profile where thoselonger sequences are not needed and even forbidden in 8-bit contextsor in contexts where aliases are undesirable and invalidated, such assecurity environments.
With your thoughts, I would conclude that assigning characters in theG1 set was also a duplicate, because it is reachable with a C0"shifting" control + a position of the G0 set. In that case ISO 8859-1or Windows 1252 was also an unneeded duplication ! And we would livetoday in a 7-bit only world.
C1 controls have their own identity. The 7-bit encoding using ESC isjust a hack to make them fit in 7-bit and it only works where the ESCcontrol is assumed to play this function according to ISO 2022, ISO6429, or other similar old 7-bit protocols such as Videotext (whichwas widely used in France with the free "Minitel" terminal, longbefore the introduction of the Internet to the general public around1992-1995).
Today Videotext is definitely dead (the old call numbers for this slowservice are now definitely defunct, the Minitels are recycled wastes,they stopped being distributed and replaced by applications on PCconnected to the Internet, but now all the old services are directlyon the internet and none of them use 7-bit encodings for their HTMLpages, or their mobile applications). France has also definitelyabandoned its old French version of ISO 646, there are no longer anyprinter supporting versions of ISO 646 other than ASCII, but theystill support various 8-bit encodings.
7-bit encodings are things of the past (they were only justified attimes where communication links were slow and generated lots oftransmission errors, and the only implemented mecanism to check themwas to use a single parity bit per character. Today we transmit longdatagrams and prefer using checks codes for the whole (such as CRC, orautocorrecting codes). 8-bit encodings are much easier and faster toprocess for transmitting not just text but also binary data.
Let's forget the 7-bit world definitely. We have also abandonned theold UTF-7 in Unicode ! I've not seen it used anywhere except in a fewold emails sent at end of the 90's, because many mail servers werestill not 8-bit clean and silently transformed non-ASCII bytes inunpredictable ways or using unspecified encodings, or just siltentlydropped the high bit, assuming it was just a parity bit : at thattime, emails were not sent with SMTP, but with the old UUCP protocoland could take weeks to be delivered to the final recipient, as therewas still no global routing infrastructure and many hops werenecessary via non-permanent modem links. My opinion of UTF-7 is thatit was just a temporary and experimental solution to help systemadmins and developers adopt the new UCS, including for their old 7-bitenvironments.



On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote:

On 10/6/2015 5:24 AM, Sean Leonard wrote:
And, why did Unicode deem it necessary to replicate the C1 block at0x80-0x9F, when all of the control characters (codes) were equallyreachable via ESC 4/0 - 5/15? I understand why it is desirable toalign U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF withLatin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all theother non-ISO-standardized 8-bit encodings got this much right:duplicating control codes is basically a waste of very preciouscharacter code real estate
Because Unicode aligns with ISO 8859-1, so that transcoding from thatwas a simple zero-fill to 16 bits.
8859-1 was the most widely used single byte (full 8-bit) ISO standardat the time, and making that transition easy was beneficial, bothpractically and politically.
Vendor standards all disagreed on the upper range, and it would nothave been feasible to single out any of them. Nobody wanted to followthe IBM code page 437 (then still the most widely used single bytevendor standard).
Note, that by "then" I refer to dates earlier than the dates of thefinal drafts, because may of those decisions date back to earlierperiods where the drafts were first developed.Also, the overloading of0x80-0xFF by Windows did not happen all at once, earlier versions hadleft much of that space open, but then people realized that as long asyou were still limited to 8 bits, throwing away 32 codes was an issue.
Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now),don't matter, so being "clean" didn't cost much. (Note that even forUTF-8, there's no special benefit of a value being inside that secondrange of 128 codes.
Finally, even if the range had not been dedicated to C1, the 32 codeswould have had to be given space, because the translation into ESCsequences is not universal, so, in transcoding data you needed to havea way to retain the difference between the raw code and the ESCsequence, or your round-trip would not be lossless.
A./

Re: Why Nothing Ever Goes Away

Reply via email to