Luc <[email protected]> added the comment:
If we are trying to fix this, the behavior should be like computing the mean or
harmonic mean with the statistics library when there are missing values in the
data. At least that way, it is consistent with how the statistics library
works when computing with NaNs in the data. Then again, it should be mentioned
somewhere in the docs.
import statistics as stats
import numpy as np
import pandas as pd
data = [75, 90,85, 92, 95, 80, np.nan]
stats.mean(data)
nan
stats.harmonic_mean(data)
nan
stats.stdev(data)
nan
As you can see, when there is a missing value, computing the mean, harmonic
mean and sample standard deviation with the statistics library
return a nan.
However, with the median, median_high and median_low, it computes those
statistics incorrectly with the missing values present in the data.
It is better to return a nan, then let the user drop (or resolve) any missing
values before computing.
## Another example using pandas serie
df = pd.DataFrame(data, columns=['data'])
df.head()
data
0 75.0
1 90.0
2 85.0
3 92.0
4 95.0
5 80.0
6 NaN
### Use the statistics library to compute the median of the serie
stats.median(df1['data'])
90
## Pandas returns the correct median by dropping the missing values
## Now use pandas to compute the median of the serie with missing value
df['data'].median()
87.5
I did not test the median_grouped in statistics library, but will let you know
afterwards if its affected as well.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue33084>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com