byte count unicode string

willie Wed, 20 Sep 2006 15:55:09 -0700

Martin v. Löwis:

 >willie schrieb:
 >
 >> Thank you for your patience and for educating me.
 >> (Though I still have a long way to go before enlightenment)
 >> I thought Python might have a small weakness in
 >> lacking an efficient way to get the number of bytes
 >> in a "UTF-8 encoded Python string object" (proper?),
 >> but I've been disabused of that notion.
 >
 >Well, to get to the enlightenment, you have to understand
 >that Unicode and UTF-8 are *not* synonyms.
 >
 >A Python Unicode string is an abstract sequence of
 >characters. It does have an in-memory representation,
 >but that is irrelevant and depends on what microprocessor
 >you use. A byte string is a sequence of quantities with
 >8 bits each (called bytes).
 >
 >For each of them, the notion of "length" exists: For
 >a Unicode string, it's the number of characters; for
 >a byte string, the number of bytes.
 >
 >UTF-8 is a character encoding; it is only meaningful
 >to say that byte strings have an encoding (where
 >"UTF-8", "cp1252", "iso-2022-jp" are really very
 >similar). For a character encoding, "what is the
 >number of bytes?" is a meaningful question. For
 >a Unicode string, this question is not meaningful:
 >you have to specify the encoding first.
 >
 >Now, there is no len(unicode_string, encoding) function:
 >len takes a single argument. To specify both the string
 >and the encoding, you have to write
 >len(unicode_string.encode(encoding)). This, as a
 >side effect, actually computes the encoding.
 >
 >While it would be possible to answer the question
 >"how many bytes has Unicode string S in encoding E?"
 >without actually encoding the string, doing so would
 >require codecs to implement their algorithm twice:
 >once to count the number of bytes, and once to
 >actually perform the encoding. Since this operation
 >is not that frequent, it was chosen not to put the
 >burden of implementing the algorithm twice (actually,
 >doing so was never even considered).



Thanks for the thorough explanation. One last question
about terminology then I'll go away :)
What is the proper way to describe "ustr" below?

 >>> ustr = buf.decode('UTF-8')
 >>> type(ustr)
<type 'unicode'>


Is it a "unicode object that contains a UTF-8 encoded
string object?"

-- 
http://mail.python.org/mailman/listinfo/python-list

byte count unicode string

Reply via email to