On Sep 22, 2019, at 12:50, Richard Musil <risa20...@gmail.com> wrote: > >> On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarn...@yahoo.com> wrote: > >> One case I think you didn’t test is when the strings are generated in >> already-sorted order.
> For the sake of completeness I did some benchmarks also with already sorted > strings. I’m a bit surprised that it didn’t make a huge difference—but only a bit; it really is hard to predict performance when you get down to this level of nitpicking allocation with this many layers of abstraction in the way. >> But I think the real win for already-sorted data is laziness. With sets, at >> least one of your data sets has to be strict, so you can put it in a set. >> With sorted lists, with a minor rewrite to the algorithms (which I posted >> earlier for intersection), you can use them directly on iterators instead. >> Imagine you’re reading 7GB worth of (already-sorted) strings and comparing >> to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap >> space. Clearly the set, sorted list, or even anything clever that pandas >> might do, they’ll all throw your machine into swap hell, which will probably >> swamp the cost of using step-compare instead of sets and of using Iterators >> instead of lists. This one might be harder to benchmark (because the cost of >> the Iterator is going to end up threaded through the cost of the algorithm), >> but it seems like it should be a huge win over sets. > > If you move everything to the iterators and maybe even dilute the operations > over some async I/O it may be pretty effective, but when talking about the > algorithm alone, I am not sure even the pre-sorted lists have an advantage, > at least not in this implementation, maybe when implemented in C? You don’t really have to do anything to move the operations to the iterators. That’s the whole beauty of generator expressions, itertools, etc.: pile on as many Iterator transforms as you like, including things like union or difference, and the resulting Iterator automatically interleaves the work lazily on demand. Which works even if the inputs are coming in from some kind of external IO. (Making it work with push-model asyncio or similar frameworks does require rewriting things slightly, however.) I slapped together a complete (but barely tested, so you might want to verify correctness) implementation at https://github.com/abarnert/iterset if you want to benchmark it. It’s not at all optimized (in fact, there will usually be two extra generators in the path to each value, plus the overhead for checking and not calling a nonexistent key function, etc.), so I’m sure it’ll be slowed down by a considerable constant modifier for the in-memory case. But if you run it against two generators of strings, compared to the cost of building a set or two lists with the whole thing in memory, I wouldn’t be surprised if it overtakes them before you even get to swap hell or MemoryError. (Assuming you consume the result iteratively, of course.) And it will definitely beat them when you do get to that size. Of course if you really do need to intersect two massive sorted inputs that aren’t already in memory, I’ll bet there’s some kind of pandas/HDF way to do it that buffers up large chunks and optimizes the hell out of everything and blows my code away.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CYWPPXIS5IETXMJO7ODCT5654OYBQPZ7/ Code of Conduct: http://python.org/psf/codeofconduct/