Paul Rubin: > I still don't get it. UTF-16 is just a data compression scheme, right? > I mean, s[17] isn't the 17th character of the (unicode) string regardless > of which memory byte it happens to live at? It could be that that accessing > it takes more than constant time, but that's hidden by the implementation.
Python Unicode strings are arrays of code units which are either 16 or 32 bits wide with the width of a code unit determined when Python is compiled. s[17] will be the 18th code unit of the string and is found by indexing with no ancillary data structure or processing to interpret the string as a sequence of code points. This is the same technique used by other languages such as Java. Implementing the Python string type with a data structure that can switch between UTF-8, UTF-16 and UTF-32 while preserving the appearance of a UTF-32 sequence has been proposed but has not gained traction due to issues of complexity and cost. Neil -- http://mail.python.org/mailman/listinfo/python-list