On 5/26/2021 12:07 PM, Chris Angelico wrote:
On Thu, May 27, 2021 at 1:59 AM Jon Ribbens via Python-list
<python-list@python.org> wrote:

On 2021-05-26, Alan Gauld <alan.ga...@yahoo.co.uk> wrote:
On 25/05/2021 23:23, Terry Reedy wrote:
In CPython's Flexible String Representation all characters in a string
are stored with the same number of bytes, depending on the largest
codepoint.

I'm learning lots of new things in this thread!

Does that mean that if I give Python a UTF8 string that is mostly single
byte characters but contains one 4-byte character that Python will store
the string as all 4-byte characters?

Note that while unix uses utf-8, Windows uses utf-16.

If so, doesn't that introduce a pretty big storage overhead for
large strings?

Memory is cheap ;-)


This is true, but sometimes memory translates into time - either
direction. When the Flexible String Representation came in, it was
actually an alternative to using four bytes per character on ALL
strings (not just those that contain non-BMP characters),

Except on Windows, where CPython used 2 bytes/char + surrogates for non-BMP char. This meant that indexing did not quite work on Windows and that applications that allowed astral chars and wanted to work on all systems had to have separate code for Windows and unix-based systems.

and it
actually improved performance quite notably, despite some additional
complications.

And it made CPython text manipulation code work on all CPython system.

Performance optimization is a funny science :)


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to