[Python-ideas] Re: Set operations with Lists

Richard Higginbotham Fri, 20 Sep 2019 14:53:45 -0700

Andrew Barnert wrote:
> On Sep 20, 2019, at 09:21, Richard Higginbotham higgi...@gmail.com wrote:
> > Richard Musil wrote:
> > On Fri,
> > Sep 20, 2019 at 1:52 AM Andrew Barnert abarn...@yahoo.com wrote:
> > On Sep 19, 2019, at 15:18, Richard Musil risa20...@gmail.com wrote:
> > After some thought I don't think the the time comparisons are useful unless
> > we include the set init time. If we already have two sets, would anyone 
> > want to compare
> > and convert that into a list? In the use case where we read a list of 
> > files, wouldn't we
> > want to put that into a list instead of a set? I would think we would only 
> > put it into a
> > list if we knew we would be doing set operations.
> > And?
> If I want to get a-b, a&b, and b-a, as I’d need to for a directory diff, I’m 
> not
> going to do it by keeping them in lists and converting to sets three times.
> Just like I’m not going to keep them unsorted and sort them three times to do
> step-compare operations.
> So the cost of the operations themselves, separate from the setup time, 
> clearly does
> matter.
Well yes it matters, but if we focus on that then we dismiss the obvious use 
case of running a single operation over two lists and then proceeding to do 
something reasonable with the results besides garbage collection. Maybe this 
will help with what I'm trying to say, I used Richard Musil's code with the 
random shuffle to run some more time tests and expanded the size of the 
collections further.


For A not in B:
trying A 100000, B 500000
best relcomp_set test time (msec): 21.273
best relcomp_set with set init test time (msec): 143.083
best relcomp_set with set init intersect test time (msec): 79.064
best list_relcomp test time (msec): 35.025
best sort test time (msec): 9.160

For A not in B:
trying A 1000000, B 5000000
best relcomp_set test time (msec): 282.933
best relcomp_set with set init test time (msec): 1795.730
best relcomp_set with set init intersect test time (msec): 1292.691
best list_relcomp test time (msec): 352.033
best sort test time (msec): 87.534

The first point shows my algorithm at about the same performance as a raw set 
op. So does the second. That's C code versus python. The time for the whole 
operation though is in another level.
While typing I got another
For A not in B:
trying A 1000000, B 50000000
best relcomp_set test time (msec): 293.465
best relcomp_set with set init test time (msec): 19223.592
best relcomp_set with set init intersect test time (msec): 10237.699
best list_relcomp test time (msec): 995.477
best sort test time (msec): 735.441

this just continues the trend. 

> > At that point we would need to compare adding the
> > elements individually to a list versus a set to then compare the set 
> > operations with list
> > operations fairly.
> > Why would you be adding them individually? Surely, at least if you care 
> > this much
> about micro-optimization, you’re going to be adding them in bulk. If you’re 
> reading a file
> full of file names, you’d just do list(f) or set(f). If you’re reading 10 
> files full of
> file names, or getting them out of os.walk steps, you’re going to call extend 
> or union
> with each batch, not iterate one by one and add them.
> Of course the actual details depend on the actual use case, and I’m not sure 
> why you’re
> trying to come up with a microbenchmark that exactly fits some imaginary and 
> ill-defined
> use in the first place. Do you then need to go simulating a filesystem’s 
> readdir
> performance characteristics in some particular but arbitrary scenario?
> What’s the goal here?

I'm not concerned about micro op, I'm concerned with macro op on large data 
sets. Right now if someone comes to you with that simple use case of two large 
lists you have to tell them to convert to sets and then back again. Forget the 
new programmers who tried the naive way and quite because "python is too slow", 
they are waiting 20 seconds  when we could deliver it to them in 1 second. If I 
can get within 10x of C code that seems like it would be useful for that type 
of use case. I don't know if my case is common and I've used it on more 
esoteric problems that wouldn't expect anyone to do, I'm used to dealing with a 
lot of data, but if it can be of help I came it to make it available. 

> I think it’s pretty easy to establish that sets are fast but they’re not 
> magic, sorting
> is pretty good but it’s not magic, there are some cases where step-compare is 
> faster if
> you have already-sorted lists… and none of this matters anyway, because in 
> real life there
> are very few use cases where you can equally choose between sets and step 
> compare and need
> to micro-optimize, but lots of cases where you have _other_ reasons to choose 
> between
> them.
There is no step compare option in python now though. You either convert to set 
or you hope the sun doesn't expire before your program finishes. 

> Sets work on non-comparable values; step-compare works on non-hashable values.
> Step-compare (if implemented right) works on lazy iterables; sets can 
> preserve the order
> of the “master”input. They do different things with duplicates. Even when 
> performance is
> the only issue (and it’s not dominated by the file system anyway, as it would 
> be in your
> case), the characteristics of your specific data are going to make a huge 
> difference. Not
> to mention that there are often other compelling reasons to have a set or a 
> sorted list,
> at which point the choice is made for you.
> So, this already eliminates the possibility of reimplementing set operations 
> with
> step-compare as suggested earlier. But it also already makes a compelling 
> case for adding
> step-compare functions to Python—but that case needs to be made clearly.
I'm trying to make that case. For me its made a big difference, but my use case 
is solved. I don't need it added, I just thought it might help others. Chances 
are if someone is using large datasets where it would be useful they are using 
a flavor of C, Fortran or maybe numpy. If no one here thought it would be 
useful I planed to next check with the numpy folks. 

> I think they’d fit very nicely in itertools, because they already provide an 
> algebra on
> iterables, including functions like group you that expect to be handed 
> pre-sorted
> iterables, but maybe not everyone agrees. I don’t know whether their 
> usefulness (and
> stability) is so obvious that they don’t need to go on PyPI to prove 
> themselves. I also
> don’t know whether they already exist on PyPI (maybe as part of 
> more-itertools and/or
> toolz?). Also, I’m pretty sure there are questions on StackOverflow and 
> elsewhere from
> people trying to do step-compare and getting it wrong, which could be another 
> argument for
> having it in the stdlib, or at least the recipes in the docs, but only if 
> someone digs up
> the questions. And there can be bikeshedding discussions about exactly which 
> operations
> are needed, and what to call them. But further artificial micro benchmarks 
> aren’t going to
> help resolve any of those questions.
I've never come across another step and compare. I've looked quite a bit in the 
past but not since I developed my own. It seems numpy might have something but 
I haven't looked into it yet. There are probably some fortran libraries..
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Y26FKW2FRWQNFHQTY5UTGI3NLZYPRZGV/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to