Scott David Daniels wrote:
... Here's a heuristic replacement for my previous frequency code: I've tried to mark where you could fudge numbers if the run time is at all close.
Boy, I cannot let go. I did a bit of a test checking for cost to calculated number of discovered samples, and found after: import timeit import numpy original = numpy.random.random(0, 100, (1000, 1000)).astype(int) data = original.flatten() data.sort() part = data[::100] t = timeit.Timer('sum(part[:-1]==part[1:])', 'from __main__ import part') v = timeit.Timer('len(part[part[:-1]==part[1:]])', 'from __main__ import part') I got: >>> t.repeat(3, 10) [0.58319842326318394, 0.57617574300638807, 0.57831819407238072] >>> v.repeat(3, 1000) [0.93933027801040225, 0.93704535073584339, 0.94096260837613954] So, len(part[mask]) is almost 50X faster! I checked: >>> sum(part[:-1]==part[1:]) 9393 >>> len(part[part[:-1]==part[1:]]) 9393 That's an awful lot of matches, so I with high selectivity: data = original.flatten() # no sorting, so runs missing part = data[::100] >>> t.repeat(3, 10) [0.58641335700485797, 0.58458854407490435, 0.58872594142576418] >>> v.repeat(3, 1000) [0.27352554584422251, 0.27375686015921019, 0.27433291102624935] about 200X faster >>> len(part[part[:-1]==part[1:]]) 39 >>> sum(part[:-1]==part[1:]) 39 So my new version of this (compressed) code:
... sampled = data[::stride] matches = sampled[:-1] == sampled[1:] candidates = sum(matches) # count identified matches while candidates > N * 10: # 10 -- heuristic stride *= 2 # # heuristic increase sampled = data[::stride] matches = sampled[:-1] == sampled[1:] candidates = sum(matches) while candidates < N * 3: # heuristic slop for long runs stride //= 2 # heuristic decrease sampled = data[::stride] matches = sampled[:-1] == sampled[1:] candidates = sum(matches) former = None past = 0 for value in sampled[matches]: ...
is: ... sampled = data[::stride] candidates = sampled[sampled[:-1] == sampled[1:]] while len(candidates) > N * 10: # 10 -- heuristic stride *= 2 # # heuristic increase sampled = data[::stride] candidates = sampled[sampled[:-1] == sampled[1:]] while len(candidates) < N * 3: # heuristic slop for long runs stride //= 2 # heuristic decrease sampled = data[::stride] candidates = sampled[sampled[:-1] == sampled[1:]] former = None past = 0 for value in candidates: ... This change is important, for we try several strides before settling on a choice, meaning the optimization can be valuable. This also means we could be pickier at choosing strides (try more values), since checking is cheaper than before. Summary: when dealing with numpy, (or any bulk <-> individual values transitions), try several ways that you think are equivalent and _measure_. In the OODB work I did we called this "impedance mismatch," and it is likely some boundary transitions are _much_ faster than others. The sum case is one of them; I am getting numpy booleans back, rather than numpy booleans, so conversions aren't going fastpath. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list