On Sat, Sep 21, 2019 at 1:44 AM Richard Higginbotham <higgi...@gmail.com>
wrote:

> It's written in Python. It used to require containers with 100k  or more
> elements to see a speed improvement over some other unnamed methods. These
> days its closer to 50 million with the only other contender being set (hash
> based) operations at least that I know of. I think there is a reasonable
> chance that if it were converted to C it would be comparable to if not
> better than the Set operations for people who wanted lists for what ever
> reason, even with much smaller data sets. One issue is that it is difficult
> to time test because it has side effects and because well computers. It's
> also difficult to know what users would put into a list so that we
> determine it's usefulness, either in relation to set operations or not. I'm
> used to thinking in lists so I don't have good judgement on how useful it
> would be to other people.
>

You are right, it is definitely not easy to benchmark, which I realized
after I digested what Tim and Andrew wrote about the string representation,
hashing and caching:
1) There are at least two levels of separate meta-data constructs, before
reaching the actual "const char*" for a string in a string list.
2) The hashes on strings are cached.
3) Sequentially created objects most likely end up in a consecutive memory,
when you shuffle them, everything gets messed up.

So I made a series of test to prove some of my theories. I am using timeit
(because I believe it is the best I can do), but I am trying to use while
taking into account the points mentioned above. The one consequence is that
all timeit tests I run only in one iteration to avoid any side effects from
cached hashing (I will demonstrate it on an example). I choose relatively
large datasets (10M string lists) where the execution time should be
sufficiently long to justify one iteration rule (my machine is not stressed
at all at the time I run the tests). I choose the set difference operation
(A - B) as the benchmark, which your code implements in `list_relcomp`
function (I got the latest version of `listex_timit.py` from your repo).

First let's setup the environment:

In [2]: from os import urandom
   ...: from random import shuffle
   ...: from timeit import timeit
   ...: from listex_timit import list_relcomp

Now let's see the impact of the string hash caching:

In [3]: a = [urandom(8).hex() for i in range(10000000)]
In [4]: timeit('set(a)', number=1, globals=locals())
Out[4]: 1.5521358999999961
In [5]: timeit('set(a)', number=1, globals=locals())
Out[5]: 0.9837615

What happens here is that I create a list of random strings, the first
timeit measures the set operations where all the strings have to be hashed,
the consecutive one measures only the build time for the "set construct",
while hashing has already been done. The important point is that if I run
the benchmark in several iterations, from the second iteration I will only
be benchmarking the set construction, but not the hashing and with already
10 iterations it will basically eliminate the hashing time from the
benchmark results. This may, or may not, actually correspond to the real
use case scenario, so I am not saying it is apriori wrong, just that one
has to pay attention to what he is benchmarking and if he is benchmarking
what he thinks he is benchmarking.

The second important point is a memory access pattern and the cache
trashing. Let's see what happens when I shuffle the data:

In [6]: shuffle(a)
In [7]: timeit('set(a)', number=1, globals=locals())
Out[7]: 1.4697333000000015

Now even with hashes already cached the time went up by 50% again, because
of the cache trashing. And if I start up from the fresh without any cached
hashes and shuffle the data:

In [8]: a = [urandom(8).hex() for i in range(10000000)]
In [9]: shuffle(a)
In [10]: timeit('set(a)', number=1, globals=locals())
Out[10]: 2.033926100000002

Quite a difference from the "best case". Let's see now, how the cache
trashing impacts the lists (there is no hash caching on list ops so the
time is invariant to the iterations):

In [11]: a = [urandom(8).hex() for i in range(10000000)]
In [14]: timeit('list(a)', number=1, globals=locals())
Out[14]: 0.13943980000001943
In [15]: timeit('list(a)', number=1, globals=locals())
Out[15]: 0.13443960000003585
In [16]: shuffle(a)
In [17]: timeit('list(a)', number=1, globals=locals())
Out[17]: 0.25544789999997874
In [18]: timeit('list(a)', number=1, globals=locals())
Out[18]: 0.2587161000000151

With the combined knowledge from the previous tests, I can now run the
actual benchmark for 'A - B' set difference, using the set operations and
explain the results (I am converting it back to the list to make it
equivalent to your `list_relcomp`):

In [19]: a = [urandom(8).hex() for i in range(10000000)]
In [20]: b = [urandom(8).hex() for i in range(10000000)]
In [21]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[21]: 3.8052587000000244
In [22]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[22]: 2.6856725000000097

We can clearly see the impact from the hash caching on the time (I do not
include the following timeit executions, but they were giving basically the
same time from the 2nd iteration on).
On your function it runs like this:

In [23]: a = [urandom(8).hex() for i in range(10000000)]
In [24]: b = [urandom(8).hex() for i in range(10000000)]
In [25]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[25]: 18.45155139999997
In [26]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[26]: 7.87324250000006
In [27]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[27]: 7.88726209999993

Let's say the first times are comparable as there was no data prep, the
second times are also comparable, set ops have cached the hashes, you
sorted your lists in place (which gives the significant improvement in the
subsequent runs.). This may also skew your benchmark results significantly
if you run it in iterations though (if the real use case does not preserve
the sorting).

Finally I add also the same exercise with the data shuffled, which would be
possibly the worst case scenario, first the set ops

In [28]: a = [urandom(8).hex() for i in range(10000000)]
In [29]: b = [urandom(8).hex() for i in range(10000000)]
In [30]: shuffle(a)
In [31]: shuffle(b)
In [32]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[32]: 4.671694499999944
In [33]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[33]: 3.6129220999999916
In [34]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[34]: 3.424502699999948
In [35]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[35]: 3.4897670999999946

then your `list_relcomp` implementation:

In [36]: a = [urandom(8).hex() for i in range(10000000)]
In [37]: b = [urandom(8).hex() for i in range(10000000)]
In [38]: shuffle(b)
In [39]: shuffle(a)
In [40]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[40]: 24.619852700000024
In [41]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[41]: 7.913995600000021
In [42]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[42]: 8.124123400000371
In [43]: timeit('list_relcomp(a, b, True, True)', number=1,
globals=locals())
Out[43]: 8.000760899999932

I wrote all this to show that without an insight it might be sometimes
difficult or impossible to do it right (I was caught myself in several
pitfalls on my original benchmarks I posted here) and also because it was
actually a fun to learn quite a few things from this thread. I am however
not convinced that step-compare lists can compete with set operations (even
when done on lists) at the current state of things.

Richard M.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/JBRAH7V5TDTEHDFX5MIE2RICPRLBQVV5/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to