Re: string storage [was: Re: imaplib: is this really so unwieldy?]

Terry Reedy Wed, 26 May 2021 13:53:30 -0700

On 5/26/2021 12:07 PM, Chris Angelico wrote:

On Thu, May 27, 2021 at 1:59 AM Jon Ribbens via Python-list
<python-list@python.org> wrote:


On 2021-05-26, Alan Gauld <alan.ga...@yahoo.co.uk> wrote:

On 25/05/2021 23:23, Terry Reedy wrote:

In CPython's Flexible String Representation all characters in a string
are stored with the same number of bytes, depending on the largest
codepoint.


I'm learning lots of new things in this thread!

Does that mean that if I give Python a UTF8 string that is mostly single
byte characters but contains one 4-byte character that Python will store
the string as all 4-byte characters?


Note that while unix uses utf-8, Windows uses utf-16.

If so, doesn't that introduce a pretty big storage overhead for
large strings?


Memory is cheap ;-)


This is true, but sometimes memory translates into time - either
direction. When the Flexible String Representation came in, it was
actually an alternative to using four bytes per character on ALL
strings (not just those that contain non-BMP characters),

Except on Windows, where CPython used 2 bytes/char + surrogates fornon-BMP char. This meant that indexing did not quite work on Windowsand that applications that allowed astral chars and wanted to work onall systems had to have separate code for Windows and unix-based systems.

and it
actually improved performance quite notably, despite some additional
complications.


And it made CPython text manipulation code work on all CPython system.

Performance optimization is a funny science :)



--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: string storage [was: Re: imaplib: is this really so unwieldy?]

Reply via email to