On Nov 2, 2019, at 20:33, Random832 <random...@fastmail.com> wrote: > >> On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote: >>> On Oct 26, 2019, at 19:59, Random832 <random...@fastmail.com> wrote: >>> >>> A string representation considering of (say) a UTF-8 string, plus an >>> auxiliary list of byte indices of, say, 256-codepoint-long chunks [along >>> with perhaps a flag to say that the chunk is all-ASCII or not] would >>> provide O(1) random access, though, of course, despite both being O(1), >>> "single index access" vs "single index access then either another index >>> access or up to 256 iterate-forward operations" aren't *really* the same >>> speed. >> >> Yes, but that means constructing a string takes linear time, because >> you have to construct that index. You can’t just take a >> read/recv/mmap/result of a C library/whatever and use it as a string >> without doing linear work on it first. > > constructing a string already takes linear time because you have to copy it > into memory managed by the python garbage collector.
Not necessarily. There are certainly _some_ strings that come into the interpreter (or extension module) as externally-allocated things that have to be copied. But not all, or even most, strings. Things like reading a file or recving from a socket, you allocate a buffer which is managed by your GC, and the string gets placed there, so there’s no need to copy it. When you mmap a file, you know the lifetime of the string is the lifetime of the mmap, so you don’t track it separately, much less copy it. And so on. Also, even when you do need to copy, a memcpy is a whole lot faster than a loop, even though they are both linear. Especially when that loop has additional operations (maybe even including a conditional that branches 80/20 or worse). But even without that, copying byte by byte, rather than by whatever chunks the CPU likes, can already be 16x as slow. Go often ends up copying strings unnecessarily, but the memcpy is still so much faster than the decode that Java/C#/Python/Ruby does that Go fanatics like to brag about their fast text handling (until you show them some Rust to Swift code that’s even faster as well as more readable…). > And you can track whether you'll need the index in one pass while copying, > rather than, as currently, having to do one pass to pick a representation and > another to actually perform the copying and conversion, so my suggestion may > have a cache locality advantage over the other way. Sure, the existing implementation of building strings is slow, and that’s what keeping strings in UTF-8 is intended to solve, and if your suggestion makes it take 1/4th as long (which seems possible, but obviously it’s just a number I pulled out of thin air), that’s nice—but nowhere near as nice as completely eliminating that cost. And most strings, you never need to randomly access (or only need to randomly access because other parts of the API, like str.find and re.search, make you), so why should you pay any cost, even if it’s only 1/4th the cost you pay in Python 3.8? (Also, for some random-access uses, it really is going to be faster to just decode to UTF-32 and subscript that; why build an index plus decoding when you can just decode?) If you’re already making a radical breaking change, why not get the radical benefits? Also, consider this: if str is unindexed and non-subscriptable, it’s trivial to build a class IndexedStr(str) whose __new__ builds the index (or copies it even passed an IndexedStr) and that adds __getitem__, while still acting as a str even at the C API level. Whenever you need random access, you construct an IndexedStr, the rest of the time you don’t bother. And you can even create special-purpose variants for special strings (I know this is always ASCII, or I know it’s always under 16 chars…) or specific use cases (I know I’m going to iterate backward, so repeatedly counting forward from index[idx%16] would be hugely wasteful). But if str builds an index, there’s no way to write a class FastStr(str) that skips that, or any of the variants that does it differently. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GTCGPA3NIXVK63QIFGR5H74YRIDGK3SR/ Code of Conduct: http://python.org/psf/codeofconduct/