David Su added the comment:
Here are some tests that I ran:
Using current naive viewkeys intersection:
$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)"
"d.viewkeys() & {0}"
100 loops, best of 3: 447 msec per loop
Nearly half a second per iteration is unacceptably slow.
Solution 1 from my previous comment:
$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0); s =
set(d)" "s & {0}"
100 loops, best of 3: 0.391 usec per loop
This solution caches the keys of the dict in a set, then performs all
intersections on that set.
The initial creation of the set is slow, and it wastes a lot of memory, but
each iteration is much faster.
The workaround that I ended up using:
$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "{item
for item in {0} if item in d}"
100 loops, best of 3: 0.629 usec per loop
This solution just looks up each value in the set to see if it is also in the
dict using a set comprehension.
This avoids wasting time and space on creating a set. For my use case, memory
was a bigger issue as my dictionary can approach tens of GBs in size, and
almost doubling the memory consumption from creating a cache set was
unacceptable.
Of course, in the set comprehension, the smaller of the dict/set should be
iterated over to satisfy the O(min(len(d), len(s))) condition as specified in
https://wiki.python.org/moin/TimeComplexity.
A implementation would look something like this:
def efficient_intersection(smaller_dict_or_set, bigger_dict_or_set):
if len(bigger_dict_or_set) < len(smaller_dict_or_set):
bigger_dict_or_set, smaller_dict_or_set = smaller_dict_or_set,
bigger_dict_or_set
return {item for item in smaller_dict_or_set if item in bigger_dict_or_set}
In conclusion, porting over the set intersection logic to viewkeys would be the
ideal solution. It avoids wasting memory like in the set cache solution, and I
am sure the C implementation will be much faster than my workaround of manually
performing an intersection using set comprehensions.
My view is that dict.viewkeys() & set(...) should be as performant as the
syntax suggests.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue27575>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com