On Sun, Dec 29, 2019 at 4:05 PM Christopher Barker <python...@gmail.com> wrote:
> >> You mean performance? Sure, but as I've argued before (no idea if anyone > agrees with me) the statistics package is already not a high performance > package anyway. If it turns out that it slows it down by, say, a factor of > two or more, then yes, maybe we need to forget it. > You never know 'till you profile, so I did a quick experiment -- adding a NaN filter is substantial overhead: This is for a list of 10,000 random floats (no nans in there, but the check is made by pre-filtering with a generator comprehension) # this just calls statistics.median directly In [14]: %timeit plainmedian(lots_of_floats) 1.54 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # this filters with math.isnan() In [15]: %timeit nanmedianfloat(lots_of_floats) 3.5 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # this filters with a complex NAN-checker that works with most types and values: floats, Decimals, numpy scalars, ... In [16]: %timeit nanmedian(lots_of_floats) 13.5 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) So the simple math,isnan filter slows it down by a factor of a bit more than two -- maybe tolerable. and the full featured isnan checker by almost a factor of ten -- that's pretty bad. I suspect if it were inline more, it could be median bit faster, and I'm sure the nan-checking code could be better optimized, but this is a pretty big hit. Note that numpy has a number of "nan*" functions, for nan-aware versions that treat NaN as missing values (including nanquantile) -- we could take a similar route, and have new names or a flag to disable or enable nan-checking. Code enclosed - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
# some tests of the impact of NaN-checking on statistics functions import math import cmath import statistics import random # A few big lists for testing: lots_of_floats = [random.random() for __ in range(10000)] lots_with_nans = lots_of_floats[100:] + [[float("NaN")] * 100] random.shuffle(lots_with_nans) def is_nan(num): """ This version works for everything I've tried """ try: return num.is_nan() except AttributeError: if isinstance(num, complex): return cmath.isnan(num) try: return math.isnan(num) except: return False def nanmedian(numbers): """ a version of median that filters out NaN values """ return statistics.median((num for num in numbers if not is_nan(num))) def nanmedianfloat(numbers): """ a version of median that filters out NaN values -- but only for values that math.isnan works on """ return statistics.median((num for num in numbers if not math.isnan(num))) def plainmedian(numbers): """ jsut a wrapper to equalize the function call overhead """ return statistics.median(numbers) # a couple sanity checks: ints = [1, 2, 3, 4, 5, 6] assert nanmedian(ints) == statistics.median(ints) floats = [1.0, 2.2, 3.3, 4.4, 5.5, 6.3] assert nanmedian(floats) == statistics.median(floats) floats_with_nan = floats[:] + [float("NaN")] * 3 random.shuffle(floats_with_nan) assert nanmedian(floats_with_nan) == statistics.median(floats)
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/643XH5Q4CPM4TZOWHUWTOGLCJ7OHD5IW/ Code of Conduct: http://python.org/psf/codeofconduct/