Martin v. Löwis: >willie schrieb: > >> Thank you for your patience and for educating me. >> (Though I still have a long way to go before enlightenment) >> I thought Python might have a small weakness in >> lacking an efficient way to get the number of bytes >> in a "UTF-8 encoded Python string object" (proper?), >> but I've been disabused of that notion. > >Well, to get to the enlightenment, you have to understand >that Unicode and UTF-8 are *not* synonyms. > >A Python Unicode string is an abstract sequence of >characters. It does have an in-memory representation, >but that is irrelevant and depends on what microprocessor >you use. A byte string is a sequence of quantities with >8 bits each (called bytes). > >For each of them, the notion of "length" exists: For >a Unicode string, it's the number of characters; for >a byte string, the number of bytes. > >UTF-8 is a character encoding; it is only meaningful >to say that byte strings have an encoding (where >"UTF-8", "cp1252", "iso-2022-jp" are really very >similar). For a character encoding, "what is the >number of bytes?" is a meaningful question. For >a Unicode string, this question is not meaningful: >you have to specify the encoding first. > >Now, there is no len(unicode_string, encoding) function: >len takes a single argument. To specify both the string >and the encoding, you have to write >len(unicode_string.encode(encoding)). This, as a >side effect, actually computes the encoding. > >While it would be possible to answer the question >"how many bytes has Unicode string S in encoding E?" >without actually encoding the string, doing so would >require codecs to implement their algorithm twice: >once to count the number of bytes, and once to >actually perform the encoding. Since this operation >is not that frequent, it was chosen not to put the >burden of implementing the algorithm twice (actually, >doing so was never even considered).
Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? >>> ustr = buf.decode('UTF-8') >>> type(ustr) <type 'unicode'> Is it a "unicode object that contains a UTF-8 encoded string object?" -- http://mail.python.org/mailman/listinfo/python-list