[Python-ideas] Re: Set operations with Lists

Richard Higginbotham Mon, 23 Sep 2019 15:35:26 -0700

Richard Musil wrote:
> On Mon, Sep 23, 2019 at 12:21 AM Richard Higginbotham higgi...@gmail.com
> wrote:
> > I really appreciate the time and thought you have put
> > into it (and others
> > here as well), and its been educational / fun for me too. One of the
> > concerns I have about using timeit is that it puts a lot of focus on the
> > exact statements that are being executed and all the internals underneath.
> > Looking at "list(set(a).difference(b))"; I suspect what I am seeing is 3 C
> > calls from python. Anything after that is a comparison between C (here) and
> > Python (in list_relcomp). I use Python for prototyping a lot so I tend to
> > look for those situations more.
> > I believe that your "suspicion" is correct and the set variant is executed
> mostly in C code (though I do not have any intimate knowledge of the Python
> implementation, so I my be wrong), while your code is executed in bytecode
> and I also believe no one here claimed otherwise. This however is not a
> problem of timeit, which main feature is that it is designed for
> benchmarking and available in the standard library. So anyone can use it,
I see what you are saying, but the reason I put a python function in there that 
then called the set functions was so that the results wouldn't be so heavily 
skewed in one direction or the other. I also used the simplest hashable values 
I could think of (short strings and integers). Testing the run time divergence 
of algorithms is probably art more than science, but the main point is that if 
there is a divergence either from one algorithm compared with another or from 
an algorithm and the theoretical big O, there is a good chance it is 
significant. The actuall execution times themselves matter less than how they 
change with increasing data sizes. Outside that, benchmarks don't really matter 
that much. There's a lot of emphasis on finding the minimal bench mark time but 
it really doesn't matter. I think the discussion we have had shows that. You 
can benchmark the wrong thing, or benchmark with the wrong equipment just like 
you can incorrectly benchmark with theory in substitute of well
  benchmarking. I think a good argument could be made for using the worst 
benchmark over the best, or perhaps the worst on the equipment where such an 
algorithm will actually run. That is probably more true for dealing with large 
data sets than Python in general, I'm reasonable enough to see that.. You can 
always set a limit past which benchmarks don't matter. As long as you aren't 
Gates talking about 386's at least, that is something that is easily 
understandable by most people. If someone had said that 1 million elements was 
the limit for any set / list comparisons it would be difficult to argue with 
that. I wouldn't want to.


The problem i have with timeit is actually just side effects. The same issue 
comes into play with sets if you are prebuilding them. If I use timit to run my 
algorithm 10 times the first execution takes the most time, and it sorts the 
lists, later executions are faster but I don't want those in the results. I'm 
looking for avg to worst case results.

In the case of comparing c code with byte code we've already given set 
comparisons a huge boost by choosing simple hashable items and a C 
implementation. trying to optimize it even further by using the least amount of 
statements possible starts to just be silly. I only said my algorithm crossed a 
threshold of speed at some number of elements. That should be expected of every 
algorithm. It's not an indictment of the set library or Python. Lots of 
perfectly good algorithms don't scale well, mine obviously doesn't scale small 
very well. 

> and the results are (somewhat) comparable. If you design your own
> benchmarking tool (I do not think datetime is a good candidate though),
> anything you gain you lose on the reproducibility and comparability.
> Besides, saying this benchmark does not give the results I expect, would
> probably need also explaining what do you expect and why.
> > When I first wrote the algorithms(2000 or so) I
> > compared them to [x for x
> > in a if x in b] and I found that the later was faster. I didn't make sense
> > at first but it was because the 'x in b' is done in C (I had just graduated
> > and started using Python). If the data set was large enough I could see the
> > big O growth was greater for the C version versus the python version.
> > Similar to what we see here in the difference between the set operations
> > and listex. I implemented listex in C and it was much faster at all data
> > sizes over maybe 10 elements. I don't expect the same speedup compared to
> > sets because they are more efficient. I also can't guarantee that the trend
> > of set operations having a higher time growth will hold true for all data
> > sizes. I only point it out and say for some use cases set relation
> > operations aren't optimal, but they may be good enough.
> > So far the benchmarks showed that that list implementation was slower than
> the sets at every non-trivial case, sometimes by quite a margin.
All the cases are trivial, small strings and some integers. When I run them 
there are differences. Even when you run them there are. Then you stop out the 
setup time so that set function are faster. Then I say that if you keep going 
the set gets beaten. Tim has already explained why, it also makes timsort so 
fast.

> I am not saying that the C implementation cannot be faster, in fact, I will
> probably be eager to test it, if you come up with one. But after what I
> read (and posted) in this thread I am also reluctant to accept that C
> implementation _must_ be faster (or must be significantly faster). Maybe,
> your code, even when run in Python, is well optimized already, and won't
> get a significant speedup or will be 5x faster, but, honestly, I do not
> dare even to guess. The only way to prove it is to actually implement it in
> C.
I've been busy lately but I should have something soon.

> Considering your use case however, I wonder, if you would not be better
> going with the iterator approach (as Andrew has been hinting already for
> some time in his posts).
> Richard M.
They will most likely have good performance on data sets that are close, but 
very poor on those that aren't balanced.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/NVJIZB2SVY3S4OO5DNEABWOVSTSL3U7K/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to