John Nagle wrote:
On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
This still seems odd to me. I would have thought that the unicode
function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte
stream
to some kind of escaped Ascii before it can be written back out.
Here's what's really going on.
Unicode strings within Python have to be indexable. So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.
UTF-8 is a stream format for Unicode. It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each. The format is
described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins. So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.
Not entirely correct. The advantage of UTF-8 is that although different
codepoints might be encoded into different numbers of bytes it's easy to
tell whether a particular byte is the first in its sequence, so you
don't have to parse from the start of the file. It is true, however, it
can't be easily indexed.
That's why it's necessary to convert to UTF-8 before writing
to a file or socket.
--
http://mail.python.org/mailman/listinfo/python-list