[Python-ideas] Re: Set operations with Lists

Richard Higginbotham Mon, 23 Sep 2019 19:32:00 -0700

Andrew Barnert wrote:
> On Sep 23, 2019, at 15:32, Richard Higginbotham higgi...@gmail.com wrote:
> > Considering
> > your use case however, I wonder, if you would not be better
> > going with the iterator approach (as Andrew has been hinting already for
> > some time in his posts).
> > Richard M.
> > They will most likely have good performance on data sets that are close, 
> > but very poor on
> > those that aren't balanced.
> > The code you posted doesn’t do any galloping, it just advances the index
> one at a time. The fact that you could do approximate binary searches to 
> optimize it
> doesn’t help if you never do those searches. So it’ll be exactly as poor on 
> unbalanced
Yes it probably will, but we won't be able to add binary searches using 
iterators. In which case, why use iterators... It's a pretty basic optimization 
step.


> inputs as my Iterator code. After all, they’re exactly the same algorithm 
> (except for
> trivial differences in how they handle one input or the other ending early); 
> the only
> difference is that one implementation calls next and remembers the value, the 
> other
> increases an index and indexes a list.
I'm not bragging about my code. I think its neat, but I wrote it 20 years ago 
after graduating from college with a chemical engineering degree. It was my 
second python program and like 3rd or 4th program outside re-implementing the C 
library (as is common with students). 

> Of course if you always have lists already in memory, the added overhead of 
> using an
> Iterator over the list is cost for no benefit. But that’s a fixed multiplier, 
> it’s not
> different for balanced vs. unbalanced inputs. And if course if the Iterator 
> ever allows
> you to avoid building an otherwise-unnecessary list, it’s a win, but that win 
> is also a
> flat cost that depends only on the allocation time for N elements, which also 
> doesn’t vary
> based on balanced vs. unbalanced inputs.
> Anyway, if your use case really is a gigantic directory from a filesystem that
> guarantees sorted order and it supports iterdir, iterators should help quite 
> a bit.

In the spirit of comradery, I'll try to follow Steven D'Aprano's community 
advice and be a little more blunt. I'll skip the insults though. I wrote them 
but I took them out. Honestly, I could have done this at any time I just didn't 
want to hurt peoples feelings. I already converted it to C along time ago and 
did many many test I don't want to repeat. I know how it scales. We don't even 
have to convert to C. Let's just use pypy. For grins we'll change to an integer 
hash. 

For A not in B:
trying A 1000, B 5000
best relcomp_set test time (msec): 0.009
best relcomp_set with set init test time (msec): 0.130
best relcomp_set with set init intersect test time (msec): 0.115
best list_relcomp test time (msec): 0.012
best sort test time (msec): 0.006

For A not in B:
trying A 10000, B 50000
best relcomp_set test time (msec): 0.117
best relcomp_set with set init test time (msec): 2.448
best relcomp_set with set init intersect test time (msec): 2.649
best list_relcomp test time (msec): 0.077
best sort test time (msec): 0.048

For A not in B:
trying A 1000000, B 5000000
best relcomp_set test time (msec): 82.131
best relcomp_set with set init test time (msec): 880.753
best relcomp_set with set init intersect test time (msec): 886.761
best list_relcomp test time (msec): 7.902
best sort test time (msec): 5.826

For A not in B:
trying A 5000000, B 10000000
best relcomp_set test time (msec): 557.368
best relcomp_set with set init test time (msec): 3228.475
best relcomp_set with set init intersect test time (msec): 3725.047
best list_relcomp test time (msec): 25.626
best sort test time (msec): 13.632

Huh? At about 1000 elements they are equal but after that my 2nd python program 
is 10x faster, oops. Lets try more iterations "with a real Benchmark program". 
I won't repeat mine though well just take the first number that comes by.

trying A 1000, B 5000
best relcomp_set test time (msec): 0.009
best relcomp_set with set init test time (msec): 0.116
best relcomp_set with set init intersect test time (msec): 0.118
best list_relcomp test time (msec): 0.015
best sort test time (msec): 0.006

For A not in B:
trying A 10000, B 50000
best relcomp_set test time (msec): 0.115
best relcomp_set with set init test time (msec): 2.729
best relcomp_set with set init intersect test time (msec): 2.780
best list_relcomp test time (msec): 0.111
best sort test time (msec): 0.049

You don't want to see data sets higher than that. Ok, a little sarcasm but you 
earned it.

All you had to do was not be jerks.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Q7RREZTQVDILSVTO6YUFJDUA5LFK6V77/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to