Terry Reedy <tjre...@udel.edu> writes: >> Meanwhile, an example of the 393 approach failing: > I am completely baffled by this, as this example is one where the 393 > approach potentially wins.
What? The 393 approach is supposed to avoid memory bloat and that does the opposite. >> I was involved in a project that dealt with terabytes of OCR data of >> mostly English text. So the chars were mostly ascii, > 3.3 stores ascii pages 1 byte/char rather than 2 or 4. But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. > I doubt that there are really any non-bmp chars. You may be right about this. I thought about it some more after posting and I'm not certain that there were supplemental characters. > As Steven said, reject such false identifications. Reject them how? >> That's a natural for UTF-8 > 3.3 would convert to utf-8 for storage on disk. They are already in utf-8 on disk though that doesn't matter since they are also compressed. >> but the PEP-393 approach would bloat up the memory >> requirements by a factor of 4. > 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally > better than always? The bloat is in comparison with utf-8, in that example. > That looks like a 3.2- narrow build. Such which treat unicode strings > as sequences of code units rather than sequences of codepoints. Not an > implementation bug, but compromise design that goes back about a > decade to when unicode was added to Python. I thought the whole point of Python 3's disruptive incompatibility with Python 2 was to clean up past mistakes and compromises, of which unicode headaches was near the top of the list. So I'm surprised they seem to repeated a mistake there. > I would call it O(k), where k is a selectable constant. Slowing access > by a factor of 100 is hardly acceptable to me. If k is constant then O(k) is the same as O(1). That is how O notation works. I wouldn't believe the 100x figure without seeing it measured in real-world applications. -- http://mail.python.org/mailman/listinfo/python-list