On Wed, Jun 4, 2014 at 5:02 PM, <mar...@v.loewis.de> wrote: > There are more things to consider for the internal implementation, > in particular how the string length is implemented. Several alternatives > exist: > 1. store the UTF-8 length (i.e. memory size) > 2. store the number of code points (i.e. Python len()) > 3. store both > 4. store neither, but use null termination instead > > Variant 3 is most run-time efficient, but could easily use 8 bytes > just for the length, which could outweigh the storage of the actual > data. Variants 1 and 2 lose on some operations (1 loses on computing > len(), 2 loses on string concatenation). 3 would add the restriction > of not allowing U+0000 in a string (which would be reasonable IMO), > and make all length computations inefficient. However, it wouldn't > be worse than standard C.
The current implementation stores a 16-bit length, which is both the memory size and the len(). As far as I can see, the memory size is never needed, so I'd just go for option 2; string concatenation is already known to be one of those operations that can be slow if you do it badly, and an optimized str.join() would cover the recommended use-case. ChrisA _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com