Let me go back to the top here.

On Sep 18, 2019, at 12:29, Richard Higginbotham <higgi...@gmail.com> wrote:
> 
> I have frequently come across cases where I would like to compare items in 
> one list in another similar to relational algebra.

I’ve put together an itertools-style implementation at 
https://github.com/abarnert/iterset if anyone’s interested. It includes merge, 
union, intersection, difference, symmetric_difference, issubset, and 
issuperset, all taking two sorted iterables and a key function.

While all of these things (except merge) can be done by just putting one 
iterable into a set (or Counter, if you need dups) and passing the other to a 
method, there are plenty of reasons to prefer these functions in some cases: 
preserving input order, taking key functions, working on non-hashable values, 
chaining up itertools-style with other Iterator transforms, etc.

Notice that the C++ standard library has exactly the same set of functions at 
the same level of flexibility (they take C++ iterator ranges, and an optional 
less-than function), despite having methods on (both unordered hash and ordered 
tree) sets, for much the same reasons. Also notice that itertools already has 
at least one function that expects to work on sorted iterables, groupby. And 
that in general itertools is intended to provide an “iterator algebra” that 
this fits into nicely.

So, I think they’re useful enough to exist somewhere, possibly in itertools, 
possibly in a third-party library (although maybe as part of more-itertools 
and/or toolz rather than on their own). 

It’s barely tested, not at all optimized, not fully documented, and generally 
not ready to turn into a patch to the stdlib, more-itertools, or toolz or into 
a PyPI package. But if anyone wants it to be polished up to that level, let me 
know and I could probably find time this week. (If not, I’ll just keep it in my 
toolbox; and maybe come back to it next time I need it.)

I haven’t tested performance at all. I expect it to be slower (when used with 
already-sorted data that’s already in a list) than Richard Higginbotham’s list 
implementation by a potentially hefty multiplier. From a quick&dirty test, I 
think the multiplier can be brought way down (albeit at significant cost to 
readability), but it’ll still be higher than 1. And that’s already slower than 
sets for at least most uses, as Richard Musli has demonstrated. Of course for 
huge data where reading everything into memory means swap hell, it’ll be a 
whole lot faster, but if you really need that, I’ll bet pandas/HDF5 has even 
better alternatives. Interleaving the work with the cost of reading or 
constructing the values might help performance in some cases, but that’s hard 
to generalize about. Overall, you probably want them when you need to preserve 
input order, or use a key function, etc., not when sets aren’t fast enough.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/2NZ4HEZQNIBZTYR24J6FTFNF6KNEHUT5/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to