If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help.
If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it. (Hopefully no-one has published anything based on the current code.) -n On 26 Sep 2013 04:19, <josef.p...@gmail.com> wrote: > On Wed, Sep 25, 2013 at 11:05 PM, <josef.p...@gmail.com> wrote: > > On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirz...@gmail.com> > wrote: > >> Hi everyone, > >> > >> I'm using np.ma.corrcoef to compute the correlation coefficients among > rows > >> of a masked matrix, where the masked elements are the missing data. I've > >> observed that in some cases, the np.ma.corrcoef gives invalid > coefficients > >> that are greater than 1 or less than -1. > >> > >> Here's an example: > >> > >> x = array([[ 7, -4, -1, -7, -3, -2], > >> [ 6, -3, 0, 4, 0, 5], > >> [-4, -5, 7, 5, -7, -7], > >> [-5, 5, -8, 0, 1, 4]]) > >> > >> x_ma = np.ma.masked_less_equal(x , -5) > >> > >> C = np.round(np.ma.corrcoef(x_ma), 2) > >> > >> print C > >> > >> [[1.0 0.73 -- -1.68] > >> [0.73 1.0 -0.86 -0.38] > >> [-- -0.86 1.0 --] > >> [-1.68 -0.38 -- 1.0]] > >> > >> As you can see, the [0,3] element is -1.68 which is not a valid > correlation > >> coefficient. (Valid correlation coefficients should be between -1 and > 1). > >> > >> I looked at the code for np.ma.corrcoef, and this behavior seems to be > due > >> to the way that mean values of the rows of the input matrix are > computed and > >> subtracted from them. Apparently, the mean value is individually > computed > >> for each row, without masking the elements corresponding to the masked > >> elements of the other row of the matrix, with respect to which the > >> correlation coefficient is being computed. > >> > >> I guess the right way should be to recompute the mean value for each row > >> every time that a correlation coefficient is being computed for two rows > >> after propagating pair-wise masked values. > >> > >> Please let me know what you think. > > > > just general comments, I have no experience here > > > > From what you are saying it sounds like np.ma is not doing pairwise > > deletion in calculating the mean (which only requires ignoring > > missings in one array), however it does (correctly) do pairwise > > deletion in calculating the cross product. > > Actually, I think the calculation of the mean is not relevant for > having weird correlation coefficients without clipping. > > With pairwise deletion you use different samples, subsets of the data, > for the variances and the covariances. > It should be easy (?) to construct examples where the pairwise > deletion for the covariance produces a large positive or negative > number, and both variances and standard deviations are small, using > two different subsamples. > Once you calculate the correlation coefficient, it could be all over > the place, independent of the mean calculations. > > conclusion: pairwise deletion requires post-processing if you want a > proper correlation matrix. > > Josef > > > > > covariance or correlation matrices with pairwise deletion are not > > necessarily "proper" covariance or correlation matrices. > > I've read that they don't need to be positive semi-definite, but I've > > never heard of values outside of [-1, 1]. It might only be a problem > > if you have a large fraction of missing values.. > > > > I think the current behavior in np.ma makes sense in that it uses all > > the information available in estimating the mean, which should be more > > accurate if we use more information. But it makes cov and corrcoef > > even weirder than they already are with pairwise deletion. > > > > Row-wise deletion (deleting observations that have at least one > > missing), which would create "proper" correlation matrices, wouldn't > > produce much in your example. > > > > I would check what R or other packages are doing and follow their > > lead, or add another option. > > (similar: we had a case in statsmodels where I used initially all > > information for calculating the mean, but then we dropped some > > observations to match the behavior of Stata, and to use the same > > observations for calculating the mean and the follow up statistics.) > > > > > > looks like pandas might be truncating the correlations to [-1, 1] (I > > didn't check) > > > >>>> import pandas as pd > >>>> x_pd = pd.DataFrame(x_ma.T) > >>>> x_pd.corr() > > 0 1 2 3 > > 0 1.000000 0.734367 -1.000000 -0.240192 > > 1 0.734367 1.000000 -0.856565 -0.378777 > > 2 -1.000000 -0.856565 1.000000 NaN > > 3 -0.240192 -0.378777 NaN 1.000000 > > > >>>> np.round(np.ma.corrcoef(x_ma), 6) > > masked_array(data = > > [[1.0 0.734367 -1.190909 -1.681346] > > [0.734367 1.0 -0.856565 -0.378777] > > [-1.190909 -0.856565 1.0 --] > > [-1.681346 -0.378777 -- 1.0]], > > mask = > > [[False False False False] > > [False False False False] > > [False False False True] > > [False False True False]], > > fill_value = 1e+20) > > > > > > Josef > > > > > >> > >> Thanks, > >> > >> Faraz > >> > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion@scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion