[Python-ideas] Re: Set operations with Lists

Andrew Barnert via Python-ideas Fri, 20 Sep 2019 09:23:57 -0700

On Sep 20, 2019, at 03:18, Richard Musil <risa20...@gmail.com> wrote:
> 
> This is an interesting point, which is difficult to deduce without an 
> implementation insight. I would for example assume that there is a "list 
> construct" which holds just the references to the actual objects (strings in 
> our example here) and the same for the dict (hashmap), so the actual objects 
> could be anywhere, while the "list" or "dict" can be in one (memory) place. 
> Did you mean that, or that also the object themselves are in the "same place"?


With a list of strings, it’s actually even more than that.

A CPython list is a (pointer to a) struct that holds, along with a few other 
things, a pointer to an array of PyObject* pointers. This array is obviously 
contiguous by definition. By comparison, a set is a (pointer to a) struct that 
holds a pointer to a hash table, which is an array of buckets in hash order, 
which isn’t the order you’re going to be accessing them in for the b set.

If the values are strings, the pointers point to PyUnicode structs, each of 
which holds… well, it’s complicated, but in this case (where they’re all pure 
ASCII and constructed out of normal Python operations) it’s going to turn out 
to be a pointer to a char*, which is the array of actual characters.

Also notice that string objects cache their hash in the struct, so for the set 
case, it only has to go to the char array at the very end to double-check 
equality; the operations before that only go one level down instead of two. 
Comparisons have to go all the way to the chars every time.

The PyUnicode structs get allocated by a special object allocator which is 
somewhat complicated. When you construct a bunch of objects in order in a fresh 
process they’re likely to end up mostly in contiguous order, maybe with a few 
jumps every so often. This would not be as true in a long-lived process that’s 
created and freed zillions of previous objects.

And then the char arrays are allocated by a different allocator, but are also 
likely to be largely in contiguous order in this test, but not so much in a 
long-lived process.

In theory, because str is a known immutable type, Python is allowed to merge 
dups into the same object. But in practice, we’re creating midsize strings out 
of __add__ and join, so I don’t think CPython will ever do that here. (I don’t 
know whether an implementation designed to sacrifice speed for space more, like 
MicroPython, might?)

Of course if you generate random strings and sort them, neither the objects nor 
the char arrays will be contiguous anymore, but the list’s pointer array will 
be.

My version of the test generates strings that happen to be in order and 
shuffles them so we can sort them. In that case, the sorting part won’t have 
contiguous input in the objects and char arrays, but the input to the 
step-compare part will be close to contiguous again (although not exactly so, 
if there are any dups that could get shuffled out of order), so it’s still 
giving the list code an unfair advantage that it wouldn’t have in a real life 
unsorted case.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FUDJU4AKJIFM2QNZP7HCMDQB2I6GWXPG/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to