[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Andrew Barnert via Python-ideas Sun, 03 Nov 2019 07:03:26 -0800

On Nov 2, 2019, at 20:33, Random832 <random...@fastmail.com> wrote:
> 
>> On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
>>> On Oct 26, 2019, at 19:59, Random832 <random...@fastmail.com> wrote:
>>> 
>>> A string representation considering of (say) a UTF-8 string, plus an 
>>> auxiliary list of byte indices of, say, 256-codepoint-long chunks [along 
>>> with perhaps a flag to say that the chunk is all-ASCII or not] would 
>>> provide O(1) random access, though, of course, despite both being O(1), 
>>> "single index access" vs "single index access then either another index 
>>> access or up to 256 iterate-forward operations" aren't *really* the same 
>>> speed.
>> 
>> Yes, but that means constructing a string takes linear time, because 
>> you have to construct that index. You can’t just take a 
>> read/recv/mmap/result of a C library/whatever and use it as a string 
>> without doing linear work on it first. 
> 
> constructing a string already takes linear time because you have to copy it 
> into memory managed by the python garbage collector.


Not necessarily. There are certainly _some_ strings that come into the 
interpreter (or extension module) as externally-allocated things that have to 
be copied. But not all, or even most, strings. Things like reading a file or 
recving from a socket, you allocate a buffer which is managed by your GC, and 
the string gets placed there, so there’s no need to copy it. When you mmap a 
file, you know the lifetime of the string is the lifetime of the mmap, so you 
don’t track it separately, much less copy it. And so on.

Also, even when you do need to copy, a memcpy is a whole lot faster than a 
loop, even though they are both linear. Especially when that loop has 
additional operations (maybe even including a conditional that branches 80/20 
or worse). But even without that, copying byte by byte, rather than by whatever 
chunks the CPU likes, can already be 16x as slow. Go often ends up copying 
strings unnecessarily, but the memcpy is still so much faster than the decode 
that Java/C#/Python/Ruby does that Go fanatics like to brag about their fast 
text handling (until you show them some Rust to Swift code that’s even faster 
as well as more readable…).

> And you can track whether you'll need the index in one pass while copying, 
> rather than, as currently, having to do one pass to pick a representation and 
> another to actually perform the copying and conversion, so my suggestion may 
> have a cache locality advantage over the other way.

Sure, the existing implementation of building strings is slow, and that’s what 
keeping strings in UTF-8 is intended to solve, and if your suggestion makes it 
take 1/4th as long (which seems possible, but obviously it’s just a number I 
pulled out of thin air), that’s nice—but nowhere near as nice as completely 
eliminating that cost.

And most strings, you never need to randomly access (or only need to randomly 
access because other parts of the API, like str.find and re.search, make you), 
so why should you pay any cost, even if it’s only 1/4th the cost you pay in 
Python 3.8? (Also, for some random-access uses, it really is going to be faster 
to just decode to UTF-32 and subscript that; why build an index plus decoding 
when you can just decode?) If you’re already making a radical breaking change, 
why not get the radical benefits?

Also, consider this: if str is unindexed and non-subscriptable, it’s trivial to 
build a class IndexedStr(str) whose __new__ builds the index (or copies it even 
passed an IndexedStr) and that adds __getitem__, while still acting as a str 
even at the C API level. Whenever you need random access, you construct an 
IndexedStr, the rest of the time you don’t bother. And you can even create 
special-purpose variants for special strings (I know this is always ASCII, or I 
know it’s always under 16 chars…) or specific use cases (I know I’m going to 
iterate backward, so repeatedly counting forward from index[idx%16] would be 
hugely wasteful). But if str builds an index, there’s no way to write a class 
FastStr(str) that skips that, or any of the variants that does it differently.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GTCGPA3NIXVK63QIFGR5H74YRIDGK3SR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Reply via email to