Andrew Barnert wrote: > On Sep 20, 2019, at 09:21, Richard Higginbotham higgi...@gmail.com wrote: > > Richard Musil wrote: > > On Fri, > > Sep 20, 2019 at 1:52 AM Andrew Barnert abarn...@yahoo.com wrote: > > On Sep 19, 2019, at 15:18, Richard Musil risa20...@gmail.com wrote: > > After some thought I don't think the the time comparisons are useful unless > > we include the set init time. If we already have two sets, would anyone > > want to compare > > and convert that into a list? In the use case where we read a list of > > files, wouldn't we > > want to put that into a list instead of a set? I would think we would only > > put it into a > > list if we knew we would be doing set operations. > > And? > If I want to get a-b, a&b, and b-a, as I’d need to for a directory diff, I’m > not > going to do it by keeping them in lists and converting to sets three times. > Just like I’m not going to keep them unsorted and sort them three times to do > step-compare operations. > So the cost of the operations themselves, separate from the setup time, > clearly does > matter. Well yes it matters, but if we focus on that then we dismiss the obvious use case of running a single operation over two lists and then proceeding to do something reasonable with the results besides garbage collection. Maybe this will help with what I'm trying to say, I used Richard Musil's code with the random shuffle to run some more time tests and expanded the size of the collections further.
For A not in B: trying A 100000, B 500000 best relcomp_set test time (msec): 21.273 best relcomp_set with set init test time (msec): 143.083 best relcomp_set with set init intersect test time (msec): 79.064 best list_relcomp test time (msec): 35.025 best sort test time (msec): 9.160 For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 282.933 best relcomp_set with set init test time (msec): 1795.730 best relcomp_set with set init intersect test time (msec): 1292.691 best list_relcomp test time (msec): 352.033 best sort test time (msec): 87.534 The first point shows my algorithm at about the same performance as a raw set op. So does the second. That's C code versus python. The time for the whole operation though is in another level. While typing I got another For A not in B: trying A 1000000, B 50000000 best relcomp_set test time (msec): 293.465 best relcomp_set with set init test time (msec): 19223.592 best relcomp_set with set init intersect test time (msec): 10237.699 best list_relcomp test time (msec): 995.477 best sort test time (msec): 735.441 this just continues the trend. > > At that point we would need to compare adding the > > elements individually to a list versus a set to then compare the set > > operations with list > > operations fairly. > > Why would you be adding them individually? Surely, at least if you care > > this much > about micro-optimization, you’re going to be adding them in bulk. If you’re > reading a file > full of file names, you’d just do list(f) or set(f). If you’re reading 10 > files full of > file names, or getting them out of os.walk steps, you’re going to call extend > or union > with each batch, not iterate one by one and add them. > Of course the actual details depend on the actual use case, and I’m not sure > why you’re > trying to come up with a microbenchmark that exactly fits some imaginary and > ill-defined > use in the first place. Do you then need to go simulating a filesystem’s > readdir > performance characteristics in some particular but arbitrary scenario? > What’s the goal here? I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second. If I can get within 10x of C code that seems like it would be useful for that type of use case. I don't know if my case is common and I've used it on more esoteric problems that wouldn't expect anyone to do, I'm used to dealing with a lot of data, but if it can be of help I came it to make it available. > I think it’s pretty easy to establish that sets are fast but they’re not > magic, sorting > is pretty good but it’s not magic, there are some cases where step-compare is > faster if > you have already-sorted lists… and none of this matters anyway, because in > real life there > are very few use cases where you can equally choose between sets and step > compare and need > to micro-optimize, but lots of cases where you have _other_ reasons to choose > between > them. There is no step compare option in python now though. You either convert to set or you hope the sun doesn't expire before your program finishes. > Sets work on non-comparable values; step-compare works on non-hashable values. > Step-compare (if implemented right) works on lazy iterables; sets can > preserve the order > of the “master”input. They do different things with duplicates. Even when > performance is > the only issue (and it’s not dominated by the file system anyway, as it would > be in your > case), the characteristics of your specific data are going to make a huge > difference. Not > to mention that there are often other compelling reasons to have a set or a > sorted list, > at which point the choice is made for you. > So, this already eliminates the possibility of reimplementing set operations > with > step-compare as suggested earlier. But it also already makes a compelling > case for adding > step-compare functions to Python—but that case needs to be made clearly. I'm trying to make that case. For me its made a big difference, but my use case is solved. I don't need it added, I just thought it might help others. Chances are if someone is using large datasets where it would be useful they are using a flavor of C, Fortran or maybe numpy. If no one here thought it would be useful I planed to next check with the numpy folks. > I think they’d fit very nicely in itertools, because they already provide an > algebra on > iterables, including functions like group you that expect to be handed > pre-sorted > iterables, but maybe not everyone agrees. I don’t know whether their > usefulness (and > stability) is so obvious that they don’t need to go on PyPI to prove > themselves. I also > don’t know whether they already exist on PyPI (maybe as part of > more-itertools and/or > toolz?). Also, I’m pretty sure there are questions on StackOverflow and > elsewhere from > people trying to do step-compare and getting it wrong, which could be another > argument for > having it in the stdlib, or at least the recipes in the docs, but only if > someone digs up > the questions. And there can be bikeshedding discussions about exactly which > operations > are needed, and what to call them. But further artificial micro benchmarks > aren’t going to > help resolve any of those questions. I've never come across another step and compare. I've looked quite a bit in the past but not since I developed my own. It seems numpy might have something but I haven't looked into it yet. There are probably some fortran libraries.. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/Y26FKW2FRWQNFHQTY5UTGI3NLZYPRZGV/ Code of Conduct: http://python.org/psf/codeofconduct/