Alternatively...why you should definitely use binary searches:
Python 3.5.2+ (default, Aug 30 2016, 19:08:42)
[GCC 6.2.0 20160822] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import timeit
>>> hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in
>>> range(10000000)]
>>> sortedhashes = sorted(hashes)
>>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
1.9478233020054176
>>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
18.001392804995703
I thought this was curious behavior. I created a list of random-looking
strings, then made a sorted copy. I then found that using "in" to see if a
string exists in the sorted list took over 9 times as long!
At first, I thought since both lists are the same size, and the 'in' test is a
linear search, shouldn't they take the same amount of time? Even if there was
some trickery with branch prediction happening, that would have benefited the
sorted list. Then I remembered how lists work in Python. The original list is
going to be mostly contiguous in memory, making the memory cache quite
effective. When I create the sorted copy, I'm creating a list of references to
strings that are all over the place in memory, causing tons of cache misses.
Of course, the best solution was to implement a binary search. That turned the
membership check into a 300-700 microsecond operation.
--
https://mail.python.org/mailman/listinfo/python-list