Thanks everyone commenting on this thread. I haven't quite read it all yet (I will) but I wanted to get a few comments now.
On Thu, Dec 26, 2019 at 10:31:00AM -0500, David Mertz wrote: > Anyway, I propose that the obviously broken version of > `statistics.median()` be replaced with a better implementation. To be precise, the problem is not just the implementation, but the interface, as median is explicitly noted to require orderable data. Data with NANs is not orderable. Richard is correct: this is a case of garbage in, garbage out: if you ignore the documented requirements, you'll get garbage results. However, I am happy to accept that silent failure may not be the ideal result for everyone. Unfortunately, there is no consensus on what the ideal result is, with at least four valid responses: - the status quo: the caller is responsible for dealing with NANs, just as they are responsible for dealing with unorderable values passed to min, max, sort, etc. If you know that there are no NANs in your data, any extra processing to check for NANs is just wasted effort. - NANs represent missing values, so they should be ignored; - the presence of a NAN is an error, and should raise an exception; - NANs should propogate through the calculation, a NAN anywhere in your data should return NAN (this is sometimes called "nan poisoning"). Also note that NANs are not just a problem for median. They are a problem for all order statistics, including percentiles, quartiles and general quantiles. Python 3.8 adds a quantiles function which has the same problem: py> statistics.quantiles([NAN, 3, 4, 7, 5]) [nan, 4.0, 6.0] py> statistics.quantiles([3, 4, 7, NAN, 5]) [3.5, nan, nan] NANs aren't as big a problem for other functions like mean and stdev, but the caller may still want to make the choice of ignore, raise or return a NAN. So I would like to avoid an ad hoc response to NANs in median alone, and treat them consistently across the entire module. Marco, you don't have to use median_low and median_high if you don't like them, but they aren't any worse than any other choice for calculating order statistics. All order statistics (apart from min and max) require you to sometimes make a choice between returning a data value or interpolating between two data values, and in general there are *lots* of choices. Here are just a few of them: "Sample Quantiles in Statistical Packages", Hyndman & Fan, The American Statistician 1996, Vol 50, No 4, pp. 361-365. https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf "Quartiles in Elementary Statistics", Langford, Journal of Statistics Education Volume 14, Number 3 (2006). http://www.amstat.org/publications/jse/v14n3/langford.html For median, there are only three choices when the midpoint falls between two values: the lower value, the higher value, and the average between the two. All three choices have their pros and cons. -- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/HWKLWDBXOLMTLLLDODSJZ6PTBWYOTEGB/ Code of Conduct: http://python.org/psf/codeofconduct/