On 8 January 2014 00:34, <wxjmfa...@gmail.com> wrote: > > Point 2: This Flexible String Representation does no > "effectuate" any memory optimization. It only succeeds > to do the opposite of what a corrrect usage of utf* > do. >
UTF-8 is a variable-width encoding that uses less memory to encode code points with lower numerical values, on a per-character basis e.g. if a code point <= U+007F it will use a single byte to encode; if <= U+07FF two bytes will be used; ... up to a maximum of 6 bytes for code points >= U+4000000. FSR is a variable-width memory structure that uses the width of the code point with the highest numerical value in the string e.g. if all code points in the string are <= U+00FF a single byte will be used per character; if all code points are <= U+FFFF two bytes will be used per character; and in all other cases 4 bytes will be used per character. In terms of memory usage the difference is that UTF-8 varies its width per-character, whereas the FSR varies its width per-string. For any particular string, UTF-8 may well result in using less memory than the FSR, but in other (quite common) cases the FSR will use less memory than UTF-8 e.g. if the string contains only contains code points <= U+00FF, but some are between U+0080 and U+00FF (inclusive). In most cases the FSR uses the same or less memory than earlier versions of Python 3 and correctly handles all code points (just like UTF-8). In the cases where the FSR uses more memory than previously, the previous behaviour was incorrect. No matter which representation is used, there will be a certain amount of overhead (which is the majority of what most of your examples have shown). Here are examples which demonstrate cases where UTF-8 uses less memory, cases where the FSR uses less memory, and cases where they use the same amount of memory (accounting for the minimum amount of overhead required for each). Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> >>> fsr = u"" >>> utf8 = fsr.encode("utf-8") >>> min_fsr_overhead = sys.getsizeof(fsr) >>> min_utf8_overhead = sys.getsizeof(utf8) >>> min_fsr_overhead 49 >>> min_utf8_overhead 33 >>> >>> fsr = u"\u0001" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 1000 >>> sys.getsizeof(utf8) - min_utf8_overhead 1000 >>> >>> fsr = u"\u0081" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 1024 >>> sys.getsizeof(utf8) - min_utf8_overhead 2000 >>> >>> fsr = u"\u0001\u0081" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 2024 >>> sys.getsizeof(utf8) - min_utf8_overhead 3000 >>> >>> fsr = u"\u0101" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 2025 >>> sys.getsizeof(utf8) - min_utf8_overhead 2000 >>> >>> fsr = u"\u0101\u0081" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 4025 >>> sys.getsizeof(utf8) - min_utf8_overhead 4000 Indexing a character in UTF-8 is O(N) - you have to traverse the the string up to the character being indexed. Indexing a character in the FSR is O(1). In all cases the FSR has better performance characteristics for indexing and slicing than UTF-8. There are tradeoffs with both UTF-8 and the FSR. The Python developers decided the priorities for Unicode handling in Python were: 1. Correctness a. all code points must be handled correctly; b. it must not be possible to obtain part of a code point (e.g. the first byte only of a multi-byte code point); 2. No change in the Big O characteristics of string operations e.g. indexing must remain O(1); 3. Reduced memory use in most cases. It is impossible for UTF-8 to meet both criteria 1b and 2 without additional auxiliary data (which uses more memory and increases complexity of the implementation). The FSR meets all 3 criteria. Tim Delaney
-- https://mail.python.org/mailman/listinfo/python-list