Terry Carroll wrote: > I'm pretty iffy on this stuff myself, but as I see it, you basically have > three kinds of things here. > > First, an ascii string: > > s = 'abc' > > In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'. > > Second, a unicode string: > > u = u'abc' > > I can't say what this is "in hex" because that's not meaningful. A > Unicode character is a code point, which can be represented in a variety > of ways, depending on the encoding used. So, moving on.... > > Finally, you can have a sequence of bytes, which are stored in a string as > a buffer, that shows the particular encoding of a particular string: > > e8 = s.encode("UTF-8") > e16 = s.encode("UTF-16") > > Now, e8 and e16 are each strings (of bytes), the content of which tells > you how the string of characters that was encoded is represented in that > particular encoding.
I would say that there are two kinds of strings, byte strings and unicode strings. Byte strings have an implicit encoding. If the contents of the byte string are all ascii characters, you can generally get away with ignoring that they are in an encoding, because most of the common 8-bit character encodings include plain ascii as a subset (all the latin-x encodings, all the Windows cp12xx encodings, and utf-8 all have ascii as a subset), so an ascii string can be interpreted as any of those encodings without error. As soon as you get away from ascii, you have to be aware of the encoding of the string. encode() really wants a unicode string not a byte string. If you call encode() on a byte string, the string is first converted to unicode using the default encoding (usually ascii), then converted with the given encoding. > > In hex, these look like this. > > e8: 616263 (61 for 'a'; 62 for 'b', 63 for 'c') > e16: FFFE6100 62006300 > (FFEE for the BOM, 6100 for 'a', 6200 for 'b', 6300 for 'c') > > Now, superficially, s and e8 are equal, because for plain old ascii > characters (which is all I've used in this example), UTF-8 is equivalent > to ascii. And they compare the same: > >>>> s == e8 > True They are equal in every sense, I don't know why you consider this superficial. And if your original string was not ascii the encode() would fail with a UnicodeDecodeError. > > But that's not true of the UTF-16: > >>>> s == e16 > False >>>> e8 == e16 > False > > So (and I'm open to correction on this), I think of the encode() method as > returning a string of bytes that represents the particular encoding of a > string value -- and it can't be used as the string value itself. The idea that there is somehow some kind of string value that doesn't have an encoding will bring you a world of hurt as soon as you venture out of the realm of pure ascii. Every string is a particular encoding of character values. It's not any different from "the string value itself". > > But you can get that string value back (assuming all the characters map > to ascii): > >>>> s8 = e8.decode("UTF-8") >>>> s16 = e16.decode("UTF-16") >>>> s == s8 == s16 > True You can get back to the ascii-encoded representation of the string. Though here you are hiding something - s8 and s16 are unicode strings while s is a byte string. In [13]: s = 'abc' In [14]: e8 = s.encode("UTF-8") In [15]: e16 = s.encode("UTF-16") In [16]: s8 = e8.decode("UTF-8") In [17]: s16 = e16.decode("UTF-16") In [18]: s8 Out[18]: u'abc' In [19]: s16 Out[19]: u'abc' In [20]: s Out[20]: 'abc' In [21]: type(s8) == type(s) Out[21]: False The way I think of it is, unicode is the "pure" representation of the string. (This is nonsense, I know, but I find it a convenient mnemonic.) encode() converts from the "pure" representation to an encoded representation. The encoding can be ascii, latin-1, utf-8... decode() converts from the coded representation back to the "pure" one. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor