On Sat, Sep 21, 2019 at 1:44 AM Richard Higginbotham <higgi...@gmail.com> wrote:
> It's written in Python. It used to require containers with 100k or more > elements to see a speed improvement over some other unnamed methods. These > days its closer to 50 million with the only other contender being set (hash > based) operations at least that I know of. I think there is a reasonable > chance that if it were converted to C it would be comparable to if not > better than the Set operations for people who wanted lists for what ever > reason, even with much smaller data sets. One issue is that it is difficult > to time test because it has side effects and because well computers. It's > also difficult to know what users would put into a list so that we > determine it's usefulness, either in relation to set operations or not. I'm > used to thinking in lists so I don't have good judgement on how useful it > would be to other people. > You are right, it is definitely not easy to benchmark, which I realized after I digested what Tim and Andrew wrote about the string representation, hashing and caching: 1) There are at least two levels of separate meta-data constructs, before reaching the actual "const char*" for a string in a string list. 2) The hashes on strings are cached. 3) Sequentially created objects most likely end up in a consecutive memory, when you shuffle them, everything gets messed up. So I made a series of test to prove some of my theories. I am using timeit (because I believe it is the best I can do), but I am trying to use while taking into account the points mentioned above. The one consequence is that all timeit tests I run only in one iteration to avoid any side effects from cached hashing (I will demonstrate it on an example). I choose relatively large datasets (10M string lists) where the execution time should be sufficiently long to justify one iteration rule (my machine is not stressed at all at the time I run the tests). I choose the set difference operation (A - B) as the benchmark, which your code implements in `list_relcomp` function (I got the latest version of `listex_timit.py` from your repo). First let's setup the environment: In [2]: from os import urandom ...: from random import shuffle ...: from timeit import timeit ...: from listex_timit import list_relcomp Now let's see the impact of the string hash caching: In [3]: a = [urandom(8).hex() for i in range(10000000)] In [4]: timeit('set(a)', number=1, globals=locals()) Out[4]: 1.5521358999999961 In [5]: timeit('set(a)', number=1, globals=locals()) Out[5]: 0.9837615 What happens here is that I create a list of random strings, the first timeit measures the set operations where all the strings have to be hashed, the consecutive one measures only the build time for the "set construct", while hashing has already been done. The important point is that if I run the benchmark in several iterations, from the second iteration I will only be benchmarking the set construction, but not the hashing and with already 10 iterations it will basically eliminate the hashing time from the benchmark results. This may, or may not, actually correspond to the real use case scenario, so I am not saying it is apriori wrong, just that one has to pay attention to what he is benchmarking and if he is benchmarking what he thinks he is benchmarking. The second important point is a memory access pattern and the cache trashing. Let's see what happens when I shuffle the data: In [6]: shuffle(a) In [7]: timeit('set(a)', number=1, globals=locals()) Out[7]: 1.4697333000000015 Now even with hashes already cached the time went up by 50% again, because of the cache trashing. And if I start up from the fresh without any cached hashes and shuffle the data: In [8]: a = [urandom(8).hex() for i in range(10000000)] In [9]: shuffle(a) In [10]: timeit('set(a)', number=1, globals=locals()) Out[10]: 2.033926100000002 Quite a difference from the "best case". Let's see now, how the cache trashing impacts the lists (there is no hash caching on list ops so the time is invariant to the iterations): In [11]: a = [urandom(8).hex() for i in range(10000000)] In [14]: timeit('list(a)', number=1, globals=locals()) Out[14]: 0.13943980000001943 In [15]: timeit('list(a)', number=1, globals=locals()) Out[15]: 0.13443960000003585 In [16]: shuffle(a) In [17]: timeit('list(a)', number=1, globals=locals()) Out[17]: 0.25544789999997874 In [18]: timeit('list(a)', number=1, globals=locals()) Out[18]: 0.2587161000000151 With the combined knowledge from the previous tests, I can now run the actual benchmark for 'A - B' set difference, using the set operations and explain the results (I am converting it back to the list to make it equivalent to your `list_relcomp`): In [19]: a = [urandom(8).hex() for i in range(10000000)] In [20]: b = [urandom(8).hex() for i in range(10000000)] In [21]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[21]: 3.8052587000000244 In [22]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[22]: 2.6856725000000097 We can clearly see the impact from the hash caching on the time (I do not include the following timeit executions, but they were giving basically the same time from the 2nd iteration on). On your function it runs like this: In [23]: a = [urandom(8).hex() for i in range(10000000)] In [24]: b = [urandom(8).hex() for i in range(10000000)] In [25]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[25]: 18.45155139999997 In [26]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[26]: 7.87324250000006 In [27]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[27]: 7.88726209999993 Let's say the first times are comparable as there was no data prep, the second times are also comparable, set ops have cached the hashes, you sorted your lists in place (which gives the significant improvement in the subsequent runs.). This may also skew your benchmark results significantly if you run it in iterations though (if the real use case does not preserve the sorting). Finally I add also the same exercise with the data shuffled, which would be possibly the worst case scenario, first the set ops In [28]: a = [urandom(8).hex() for i in range(10000000)] In [29]: b = [urandom(8).hex() for i in range(10000000)] In [30]: shuffle(a) In [31]: shuffle(b) In [32]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[32]: 4.671694499999944 In [33]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[33]: 3.6129220999999916 In [34]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[34]: 3.424502699999948 In [35]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[35]: 3.4897670999999946 then your `list_relcomp` implementation: In [36]: a = [urandom(8).hex() for i in range(10000000)] In [37]: b = [urandom(8).hex() for i in range(10000000)] In [38]: shuffle(b) In [39]: shuffle(a) In [40]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[40]: 24.619852700000024 In [41]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[41]: 7.913995600000021 In [42]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[42]: 8.124123400000371 In [43]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[43]: 8.000760899999932 I wrote all this to show that without an insight it might be sometimes difficult or impossible to do it right (I was caught myself in several pitfalls on my original benchmarks I posted here) and also because it was actually a fun to learn quite a few things from this thread. I am however not convinced that step-compare lists can compete with set operations (even when done on lists) at the current state of things. Richard M.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JBRAH7V5TDTEHDFX5MIE2RICPRLBQVV5/ Code of Conduct: http://python.org/psf/codeofconduct/