On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
>> This standard data structure is called UCS-2 ... There's an extension >> to UCS-2 called UTF-16 > > My own understanding is UCS-2 simply shouldn't be used any more. Pretty much. But UTF-16 with lax support for surrogates (that is, surrogates are included but treated as two characters) is essentially UCS-2 with the restriction against surrogates lifted. That's what Python currently does, and Javascript. http://mathiasbynens.be/notes/javascript-encoding The reality is that support for the Unicode supplementary planes is pretty poor. Even when applications support it, most fonts don't have glyphs for the characters. Anything which makes handling of Unicode supplementary characters better is a step forward. >> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string >> operations are not O(1) but are O(N). That means they are slow, or >> buggy, pick one. > > This I don't see. What are the basic string operations? The ones I'm specifically referring to are indexing and copying substrings. There may be others. > * Examine the first character, or first few characters ("few" = "usually > bounded by a small constant") such as to parse a token from an input > stream. This is O(1) with either encoding. That's actually O(K), for K = "a few", whatever "a few" means. But we know that anything is fast for small enough N (or K in this case). > * Slice off the first N characters. This is O(N) with either encoding > if it involves copying the chars. I guess you could share references > into the same string, but if the slice reference persists while the > big reference is released, you end up not freeing the memory until > later than you really should. As a first approximation, memory copying is assumed to be free, or at least constant time. That's not strictly true, but Big Oh analysis is looking at algorithmic complexity. It's not a substitute for actual benchmarks. > Meanwhile, an example of the 393 approach failing: I was involved in a > project that dealt with terabytes of OCR data of mostly English text. I assume that this wasn't one giant multi-terrabyte string. > So > the chars were mostly ascii, but there would be occasional non-ascii > chars including supplementary plane characters, either because of > special symbols that were really in the text, or the typical OCR > confusion emitting those symbols due to printing imprecision. That's a > natural for UTF-8 but the PEP-393 approach would bloat up the memory > requirements by a factor of 4. Not necessarily. Presumably you're scanning each page into a single string. Then only the pages containing a supplementary plane char will be bloated, which is likely to be rare. Especially since I don't expect your OCR application would recognise many non-BMP characters -- what does U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software doesn't recognise it, you can't get it in your output. (If you do, the OCR software has a nasty bug.) Anyway, in my ignorant opinion the proper fix here is to tell the OCR software not to bother trying to recognise Imperial Aramaic, Domino Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't expecting them in your source material. Not only will the scanning go faster, but you'll get fewer wrong characters. [...] > I realize the folks who designed and implemented PEP 393 are very smart > cookies and considered stuff carefully, while I'm just an internet user > posting an immediate impression of something I hadn't seen before (I > still use Python 2.6), but I still have to ask: if the 393 approach > makes sense, why don't other languages do it? There has to be a first time for everything. > Ropes of UTF-8 segments seems like the most obvious approach and I > wonder if it was considered. Ropes have been considered and rejected because while they are asymptotically fast, in common cases the added complexity actually makes them slower. Especially for immutable strings where you aren't inserting into the middle of a string. http://mail.python.org/pipermail/python-dev/2000-February/002321.html PyPy has revisited ropes and uses, or at least used, ropes as their native string data structure. But that's ropes of *bytes*, not UTF-8. http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html -- Steven -- http://mail.python.org/mailman/listinfo/python-list