On 6/21/2010 2:46 PM, P.J. Eby wrote:

This ignores the existence of use cases where what you have is text that
can't be properly encoded in unicode.

I think it depends on what you mean by 'properly'. I will try to explain with English examples.

1. Unicode represents a finite set of characters and symbols and a few control or markup operators. The potential set is unbounded, so unicode includes a user area. I include use of that area in 'properly'. I kind of suspect that the statement above does not since any byte or short byte sequence that does not translate can instead use the user area.

2. Unicode disclaims direct representation of font and style information, leaving that to markup either in or out of the text stream. (It made an exception for japanese narrow and wide ascii chars, which I consider to essentially be duplicate font variations of the normal ascii codes.) Html uses both in-band and out-of-band (css) markup. Stripping markup information is a loss of information. If one wants it, one must keep it in one form or another.

I believe that some early editors like Wordstar used high-bit-set bytes for bold, underline, italic on and off. Assuming I have the example right, can Wordstar text be 'properly encoded in unicode'? If one insists that that mean replacement of each of the format markup chars with a single defined char in the Basic Multilingual Plane, then 'no'. If one allows replacement by <bold>, </bold>, and so on, then 'yes'.

3. Unicode disclaims direct representation of glyphic variants (though again, exceptions were made for asian acceptance). For example, in English, mechanically printed 'a' and 'g' are different from manually printed 'a' and 'g'. Representing both by the same codepoint, in itself, loses information. One who wishes to preserve the distinction must instead use a font tag or perhaps a <handprinted> tag. Similarly, older English had a significantly different glyph for 's', which looks more like a modern 'f'.

If IBM's EBCDIC had codes for these glyph variants, IBM might have insisted that unicode also have such so char for char round-tripping would be possible. It does not and unicode does not. (Wordstar and other 1980s editor publishers were mostly defunct or weak and not in a position to make such demands.)

If one wants to write on the history of glyph evolution, say of latin chars, one much either number the variants 'e-0', 'e-1', etc, or resort to the user area. In either case, proprietary software would be needed to actually print the variations with other text.

I know, it's a hard thing to wrap
one's head around, since on the surface it sounds like unicode is the
programmer's savior. Unfortunately, real-world text data exists which
cannot be safely roundtripped to unicode,

I do not believe that. Digital information can always be recoded one way or another. As it is, the rules were bent for Japanese, in a way that they were not for English, to aid round-tripping of the major public encodings. I can, however, believe that there were private encodings for which round-tripping is more difficult. But there are also difficulties for old proprietary and even private English encodings.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to