Re: string storage [was: Re: imaplib: is this really so unwieldy?]

jak Fri, 28 May 2021 14:30:49 -0700

Il 27/05/2021 05:54, Cameron Simpson ha scritto:

On 26May2021 12:11, Jon Ribbens <[email protected]> wrote:

On 2021-05-26, Alan Gauld <[email protected]> wrote:

I confess I had just assumed the unicode strings were stored
in native unicode UTF8 format.


If you do that then indexing and slicing strings becomes very slow.


True, but that isn't necessarily a show stopper. My impression, on
reflection, is that most slicing is close to the beginning or end of a
string, and that _most strings are small. (Alan has exceptions at least
to the latter.) In those circumstances, the cost of slicing a variable
width encoding is greatly mitigated.

Indexing is probably more general (in my subjective hand waving
guesstimation). But... how common is indexing into large strings?
Versus, say, iteration over a large string?

I was surprised when getting introduced to Golang a few years ago that
it stores all Strings as UTF8 byte sequences. And when writing Go code,
I found very few circumstances where that would actually bring
performance issues, which I attribute in part to my suggestions above
about when, in practical terms, we slice and index strings.

If the internal storage is UTF8, then in an ecosystem where all, or
most, text files are themselves UTF8 then reading a text file has zero
decoding cost - you can just read the bytes and store them! And to write
a String out to a UTF8 file, you just copy the bytes - zero encoding!

--------------------------------

Also, UTF8 is a funny thing - it is deliberately designed so that you
can just jump into the middle of an arbitrary stream of UTF8 bytes and
find the character boundaries. That doesn't solve slicing/indexing in
general, but it does avoid any risk of producing mojibake just by
starting your decode at a random place.

Perhaps you are referring to what the python language does if you jumpto an albiter position of an utf8 string. Otherwise, before you startdecoding, you should align at the beginning of an utf8 character bydiscarding the bytes that meet the following test:


(byte & 0xc0) == 0x80 /* Clang */

--------------------------------

Cheers,
Cameron Simpson <[email protected]>


--
https://mail.python.org/mailman/listinfo/python-list

Re: string storage [was: Re: imaplib: is this really so unwieldy?]

Reply via email to