Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes: > This is a long post. If you don't feel like reading an essay, skip to the > very bottom and read my last few paragraphs, starting with "To recap".
I'm very flattered that you took the trouble to write that excellent exposition of different Unicode encodings in response to my post. I can only hope some readers will benefit from it. I regret that I wasn't more clear about the perspective I posted from, i.e. that I'm already familiar with how those encodings work. After reading all of it, I still have the same skepticism on the main point as before, but I think I see what the issue in contention is, and some differences in perspectice. First of all, you wrote: > This standard data structure is called UCS-2 ... There's an extension > to UCS-2 called UTF-16 My own understanding is UCS-2 simply shouldn't be used any more. Unicode was historically supposed to be a 16-bit character set, but that turned out to not be enough, so the supplementary planes were added. UCS-2 thus became obsolete and UTF-16 superseded it in 1996. UTF-16 in turn is rather clumsy and the later UTF-8 is better in a lot of ways, but both of these are at least capable of encoding all the character codes. On to the main issue: > * Variable-byte formats like UTF-8 and UTF-16 mean that basic string > operations are not O(1) but are O(N). That means they are slow, or buggy, > pick one. This I don't see. What are the basic string operations? * Examine the first character, or first few characters ("few" = "usually bounded by a small constant") such as to parse a token from an input stream. This is O(1) with either encoding. * Slice off the first N characters. This is O(N) with either encoding if it involves copying the chars. I guess you could share references into the same string, but if the slice reference persists while the big reference is released, you end up not freeing the memory until later than you really should. * Concatenate two strings. O(N) either way. * Find length of string. O(1) either way since you'd store it in the string header when you build the string in the first place. Building the string has to have been an O(N) operation in either representation. And finally: * Access the nth char in the string for some large random n, or maybe get a small slice from some random place in a big string. This is where fixed-width representation is O(1) while variable-width is O(N). What I'm not convinced of, is that the last thing happens all that often. Meanwhile, an example of the 393 approach failing: I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii, but there would be occasional non-ascii chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision. That's a natural for UTF-8 but the PEP-393 approach would bloat up the memory requirements by a factor of 4. py> s = chr(0xFFFF + 1) py> a, b = s That looks like Python 3.2 is buggy and that sample should just throw an error. s is a one-character string and should not be unpackable. I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. By that I mean pick some implementation constant k (say k=128) and represent the string as a UTF-8 encoded byte array, accompanied by a vector n//k pointers into the byte array, where n is the number of codepoints in the string. Then you can reach any offset analogously to reading a random byte on a disk, by seeking to the appropriate block, and then reading the block and getting the char you want within it. Random access is then O(1) though the constant is higher than it would be with fixed width encoding. -- http://mail.python.org/mailman/listinfo/python-list