Re: [Python-Dev] email package status in 3.X

Terry Reedy Mon, 21 Jun 2010 18:48:29 -0700

On 6/21/2010 2:46 PM, P.J. Eby wrote:

This ignores the existence of use cases where what you have is text that
can't be properly encoded in unicode.

I think it depends on what you mean by 'properly'. I will try to explainwith English examples.

1. Unicode represents a finite set of characters and symbols and a fewcontrol or markup operators. The potential set is unbounded, so unicodeincludes a user area. I include use of that area in 'properly'. I kindof suspect that the statement above does not since any byte or shortbyte sequence that does not translate can instead use the user area.

2. Unicode disclaims direct representation of font and styleinformation, leaving that to markup either in or out of the text stream.(It made an exception for japanese narrow and wide ascii chars, which Iconsider to essentially be duplicate font variations of the normal asciicodes.) Html uses both in-band and out-of-band (css) markup. Strippingmarkup information is a loss of information. If one wants it, one mustkeep it in one form or another.

I believe that some early editors like Wordstar used high-bit-set bytesfor bold, underline, italic on and off. Assuming I have the exampleright, can Wordstar text be 'properly encoded in unicode'? If oneinsists that that mean replacement of each of the format markup charswith a single defined char in the Basic Multilingual Plane, then 'no'.If one allows replacement by <bold>, </bold>, and so on, then 'yes'.

3. Unicode disclaims direct representation of glyphic variants (thoughagain, exceptions were made for asian acceptance). For example, inEnglish, mechanically printed 'a' and 'g' are different from manuallyprinted 'a' and 'g'. Representing both by the same codepoint, in itself,loses information. One who wishes to preserve the distinction mustinstead use a font tag or perhaps a <handprinted> tag. Similarly, olderEnglish had a significantly different glyph for 's', which looks morelike a modern 'f'.

If IBM's EBCDIC had codes for these glyph variants, IBM might haveinsisted that unicode also have such so char for char round-trippingwould be possible. It does not and unicode does not. (Wordstar and other1980s editor publishers were mostly defunct or weak and not in aposition to make such demands.)

If one wants to write on the history of glyph evolution, say of latinchars, one much either number the variants 'e-0', 'e-1', etc, or resortto the user area. In either case, proprietary software would be neededto actually print the variations with other text.

I know, it's a hard thing to wrap
one's head around, since on the surface it sounds like unicode is the
programmer's savior. Unfortunately, real-world text data exists which
cannot be safely roundtripped to unicode,

I do not believe that. Digital information can always be recoded one wayor another. As it is, the rules were bent for Japanese, in a way thatthey were not for English, to aid round-tripping of the major publicencodings. I can, however, believe that there were private encodings forwhich round-tripping is more difficult. But there are also difficultiesfor old proprietary and even private English encodings.



--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

Reply via email to