Re: [Python-Dev] Internal representation of strings and Micropython

Chris Angelico Wed, 04 Jun 2014 00:08:27 -0700

On Wed, Jun 4, 2014 at 5:02 PM,  <[email protected]> wrote:
> There are more things to consider for the internal implementation,
> in particular how the string length is implemented. Several alternatives
> exist:
> 1. store the UTF-8 length (i.e. memory size)
> 2. store the number of code points (i.e. Python len())
> 3. store both
> 4. store neither, but use null termination instead
>
> Variant 3 is most run-time efficient, but could easily use 8 bytes
> just for the length, which could outweigh the storage of the actual
> data. Variants 1 and 2 lose on some operations (1 loses on computing
> len(), 2 loses on string concatenation). 3 would add the restriction
> of not allowing U+0000 in a string (which would be reasonable IMO),
> and make all length computations inefficient. However, it wouldn't
> be worse than standard C.


The current implementation stores a 16-bit length, which is both the
memory size and the len(). As far as I can see, the memory size is
never needed, so I'd just go for option 2; string concatenation is
already known to be one of those operations that can be slow if you do
it badly, and an optimized str.join() would cover the recommended
use-case.

ChrisA
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

Reply via email to