On Sep 21, 2019, at 04:35, Richard Musil <risa20...@gmail.com> wrote:
> 
> I wrote all this to show that without an insight it might be sometimes 
> difficult or impossible to do it right (I was caught myself in several 
> pitfalls on my original benchmarks I posted here) and also because it was 
> actually a fun to learn quite a few things from this thread. I am however not 
> convinced that step-compare lists can compete with set operations (even when 
> done on lists) at the current state of things.

One case I think you didn’t test is when the strings are generated in 
already-sorted order. In that case, as opposed to the case where you generate 
in random order and then sort, I think the PyUnicode objects and the actual 
character arrays will both be mostly contiguous (at least in a fresh 
process)—although I think the only way you could verify that is by using ctypes 
(or a C extension) to peek into the internal PyUnicode structs and look at the 
appropriate pointers. Anyway, the access pattern should more than be simple 
enough for the CPU to effectively cache-prefetch at all three levels, which 
could be a significant speedup for the sorted-list algorithms over the set 
algorithms.

And this isn’t entirely an artificial scenario; sometimes large numbers of 
strings really are generated in sorted order and in large batches like 
this—e.g., the filesystem, or a network client, or whatever is handing you data 
in sorted order and you put it right into a list of strings. So, it might be 
worth testing.

But I think the real win for already-sorted data is laziness. With sets, at 
least one of your data sets has to be strict, so you can put it in a set. With 
sorted lists, with a minor rewrite to the algorithms (which I posted earlier 
for intersection), you can use them directly on iterators instead. Imagine 
you’re reading 7GB worth of (already-sorted) strings and comparing to 2GB worth 
of strings on a machine with 8GB of RAM and plenty of swap space. Clearly the 
set, sorted list, or even anything clever that pandas might do, they’ll all 
throw your machine into swap hell, which will probably swamp the cost of using 
step-compare instead of sets and of using Iterators instead of lists. This one 
might be harder to benchmark (because the cost of the Iterator is going to end 
up threaded through the cost of the algorithm), but it seems like it should be 
a huge win over sets.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/3FZCA7GICPYLDQ2RCZGDBBZJZDLMJND2/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to