[Python-ideas] Re: Set operations with Lists

Andrew Barnert via Python-ideas Fri, 20 Sep 2019 10:23:49 -0700

On Sep 20, 2019, at 09:21, Richard Higginbotham <[email protected]> wrote:
> 
> Richard Musil wrote:
>>> On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert [email protected] wrote:
>>> On Sep 19, 2019, at 15:18, Richard Musil [email protected] wrote:
> 
> After some thought I don't think the the time comparisons are useful unless 
> we include the set init time. If we already have two sets, would anyone want 
> to compare and convert that into a list? In the use case where we read a list 
> of files, wouldn't we want to put that into a list instead of a set? I would 
> think we would only put it into a list if we knew we would be doing set 
> operations.


And?

If I want to get a-b, a&b, and b-a, as I’d need to for a directory diff, I’m 
not going to do it by keeping them in lists and converting to sets three times.

Just like I’m not going to keep them unsorted and sort them three times to do 
step-compare operations.

So the cost of the operations themselves, separate from the setup time, clearly 
does matter.

> At that point we would need to compare adding the elements individually to a 
> list versus a set to then compare the set operations with list operations 
> fairly. 

Why would you be adding them individually? Surely, at least if you care this 
much about micro-optimization, you’re going to be adding them in bulk. If 
you’re reading a file full of file names, you’d just do list(f) or set(f). If 
you’re reading 10 files full of file names, or getting them out of os.walk 
steps, you’re going to call extend or union with each batch, not iterate one by 
one and add them.

Of course the actual details depend on the actual use case, and I’m not sure 
why you’re trying to come up with a microbenchmark that exactly fits some 
imaginary and ill-defined use in the first place. Do you then need to go 
simulating a filesystem’s readdir performance characteristics in some 
particular but arbitrary scenario?

What’s the goal here?

I think it’s pretty easy to establish that sets are fast but they’re not magic, 
sorting is pretty good but it’s not magic, there are some cases where 
step-compare is faster if you have already-sorted lists… and none of this 
matters anyway, because in real life there are very few use cases where you can 
equally choose between sets and step compare and need to micro-optimize, but 
lots of cases where you have _other_ reasons to choose between them.

Sets work on non-comparable values; step-compare works on non-hashable values. 
Step-compare (if implemented right) works on lazy iterables; sets can preserve 
the order of the “master”input. They do different things with duplicates. Even 
when performance is the only issue (and it’s not dominated by the file system 
anyway, as it would be in your case), the characteristics of your specific data 
are going to make a huge difference. Not to mention that there are often other 
compelling reasons to have a set or a sorted list, at which point the choice is 
made for you.

So, this already eliminates the possibility of reimplementing set operations 
with step-compare as suggested earlier. But it also already makes a compelling 
case for adding step-compare functions to Python—but that case needs to be made 
clearly.

I think they’d fit very nicely in itertools, because they already provide an 
algebra on iterables, including functions like group you that expect to be 
handed pre-sorted iterables, but maybe not everyone agrees. I don’t know 
whether their usefulness (and stability) is so obvious that they don’t need to 
go on PyPI to prove themselves. I also don’t know whether they already exist on 
PyPI (maybe as part of more-itertools and/or toolz?). Also, I’m pretty sure 
there are questions on StackOverflow and elsewhere from people trying to do 
step-compare and getting it wrong, which could be another argument for having 
it in the stdlib, or at least the recipes in the docs, but only if someone digs 
up the questions. And there can be bikeshedding discussions about exactly which 
operations are needed, and what to call them. But further artificial micro 
benchmarks aren’t going to help resolve any of those questions.

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/JMOEUO7KUSD3NKQQE73GTN2ZANGCWYK6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to