[Python-ideas] Re: Set operations with Lists

Andrew Barnert via Python-ideas Sun, 22 Sep 2019 14:08:14 -0700

On Sep 22, 2019, at 12:50, Richard Musil <risa20...@gmail.com> wrote:
> 
>> On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarn...@yahoo.com> wrote:
> 
>> One case I think you didn’t test is when the strings are generated in 
>> already-sorted order.


> For the sake of completeness I did some benchmarks also with already sorted 
> strings.

I’m a bit surprised that it didn’t make a huge difference—but only a bit; it 
really is hard to predict performance when you get down to this level of 
nitpicking allocation with this many layers of abstraction in the way.

>> But I think the real win for already-sorted data is laziness. With sets, at 
>> least one of your data sets has to be strict, so you can put it in a set. 
>> With sorted lists, with a minor rewrite to the algorithms (which I posted 
>> earlier for intersection), you can use them directly on iterators instead. 
>> Imagine you’re reading 7GB worth of (already-sorted) strings and comparing 
>> to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap 
>> space. Clearly the set, sorted list, or even anything clever that pandas 
>> might do, they’ll all throw your machine into swap hell, which will probably 
>> swamp the cost of using step-compare instead of sets and of using Iterators 
>> instead of lists. This one might be harder to benchmark (because the cost of 
>> the Iterator is going to end up threaded through the cost of the algorithm), 
>> but it seems like it should be a huge win over sets.
> 
> If you move everything to the iterators and maybe even dilute the operations 
> over some async I/O it may be pretty effective, but when talking about the 
> algorithm alone, I am not sure even the pre-sorted lists have an advantage, 
> at least not in this implementation, maybe when implemented in C?

You don’t really have to do anything to move the operations to the iterators. 
That’s the whole beauty of generator expressions, itertools, etc.: pile on as 
many Iterator transforms as you like, including things like union or 
difference, and the resulting Iterator automatically interleaves the  work 
lazily on demand. Which works even if the inputs are coming in from some kind 
of external IO. (Making it work with push-model asyncio or similar frameworks 
does require rewriting things slightly, however.)

I slapped together a complete (but barely tested, so you might want to verify 
correctness) implementation at https://github.com/abarnert/iterset if you want 
to benchmark it.

It’s not at all optimized (in fact, there will usually be two extra generators 
in the path to each value, plus the overhead for checking and not calling a 
nonexistent key function, etc.), so I’m sure it’ll be slowed down by a 
considerable constant modifier for the in-memory case.

But if you run it against two generators of strings, compared to the cost of 
building a set or two lists with the whole thing in memory, I wouldn’t be 
surprised if it overtakes them before you even get to swap hell or MemoryError. 
(Assuming you consume the result iteratively, of course.) And it will 
definitely beat them when you do get to that size.

Of course if you really do need to intersect two massive sorted inputs that 
aren’t already in memory, I’ll bet there’s some kind of pandas/HDF way to do it 
that buffers up large chunks and optimizes the hell out of everything and blows 
my code away.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CYWPPXIS5IETXMJO7ODCT5654OYBQPZ7/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to