On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <[email protected]> wrote: > If you want a proper self-consistent correlation/covariance matrix, then > pairwise deletion just makes no sense period, I don't see how postprocessing > can help.
clipping to [-1, 1] and finding the nearest positive semi-definite matrix. For the latter there is some code in statsmodels, and several newer algorithms that I haven't looked at. It's a quite common problem in finance, but usually longer time series with not a large number of missing values. > > If you want a matrix of correlations, then pairwise deletion makes sense. > It's an interesting point that arguably the current ma.corrcoef code may > actually give you a better estimator of the individual correlation > coefficients than doing full pairwise deletion, but it's pretty surprising > and unexpected... when people call corrcoef I think they are asking "please > compute the textbook formula for 'sample correlation'" not "please give me > some arbitrary good estimator for the population correlation", so we > probably have to change it. > > (Hopefully no-one has published anything based on the current code.) I haven't seen a textbook version of this yet. Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, especially if it doesn't really solve the problem. Josef > > -n > > On 26 Sep 2013 04:19, <[email protected]> wrote: >> >> On Wed, Sep 25, 2013 at 11:05 PM, <[email protected]> wrote: >> > On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <[email protected]> >> > wrote: >> >> Hi everyone, >> >> >> >> I'm using np.ma.corrcoef to compute the correlation coefficients among >> >> rows >> >> of a masked matrix, where the masked elements are the missing data. >> >> I've >> >> observed that in some cases, the np.ma.corrcoef gives invalid >> >> coefficients >> >> that are greater than 1 or less than -1. >> >> >> >> Here's an example: >> >> >> >> x = array([[ 7, -4, -1, -7, -3, -2], >> >> [ 6, -3, 0, 4, 0, 5], >> >> [-4, -5, 7, 5, -7, -7], >> >> [-5, 5, -8, 0, 1, 4]]) >> >> >> >> x_ma = np.ma.masked_less_equal(x , -5) >> >> >> >> C = np.round(np.ma.corrcoef(x_ma), 2) >> >> >> >> print C >> >> >> >> [[1.0 0.73 -- -1.68] >> >> [0.73 1.0 -0.86 -0.38] >> >> [-- -0.86 1.0 --] >> >> [-1.68 -0.38 -- 1.0]] >> >> >> >> As you can see, the [0,3] element is -1.68 which is not a valid >> >> correlation >> >> coefficient. (Valid correlation coefficients should be between -1 and >> >> 1). >> >> >> >> I looked at the code for np.ma.corrcoef, and this behavior seems to be >> >> due >> >> to the way that mean values of the rows of the input matrix are >> >> computed and >> >> subtracted from them. Apparently, the mean value is individually >> >> computed >> >> for each row, without masking the elements corresponding to the masked >> >> elements of the other row of the matrix, with respect to which the >> >> correlation coefficient is being computed. >> >> >> >> I guess the right way should be to recompute the mean value for each >> >> row >> >> every time that a correlation coefficient is being computed for two >> >> rows >> >> after propagating pair-wise masked values. >> >> >> >> Please let me know what you think. >> > >> > just general comments, I have no experience here >> > >> > From what you are saying it sounds like np.ma is not doing pairwise >> > deletion in calculating the mean (which only requires ignoring >> > missings in one array), however it does (correctly) do pairwise >> > deletion in calculating the cross product. >> >> Actually, I think the calculation of the mean is not relevant for >> having weird correlation coefficients without clipping. >> >> With pairwise deletion you use different samples, subsets of the data, >> for the variances and the covariances. >> It should be easy (?) to construct examples where the pairwise >> deletion for the covariance produces a large positive or negative >> number, and both variances and standard deviations are small, using >> two different subsamples. >> Once you calculate the correlation coefficient, it could be all over >> the place, independent of the mean calculations. >> >> conclusion: pairwise deletion requires post-processing if you want a >> proper correlation matrix. >> >> Josef >> >> > >> > covariance or correlation matrices with pairwise deletion are not >> > necessarily "proper" covariance or correlation matrices. >> > I've read that they don't need to be positive semi-definite, but I've >> > never heard of values outside of [-1, 1]. It might only be a problem >> > if you have a large fraction of missing values.. >> > >> > I think the current behavior in np.ma makes sense in that it uses all >> > the information available in estimating the mean, which should be more >> > accurate if we use more information. But it makes cov and corrcoef >> > even weirder than they already are with pairwise deletion. >> > >> > Row-wise deletion (deleting observations that have at least one >> > missing), which would create "proper" correlation matrices, wouldn't >> > produce much in your example. >> > >> > I would check what R or other packages are doing and follow their >> > lead, or add another option. >> > (similar: we had a case in statsmodels where I used initially all >> > information for calculating the mean, but then we dropped some >> > observations to match the behavior of Stata, and to use the same >> > observations for calculating the mean and the follow up statistics.) >> > >> > >> > looks like pandas might be truncating the correlations to [-1, 1] (I >> > didn't check) >> > >> >>>> import pandas as pd >> >>>> x_pd = pd.DataFrame(x_ma.T) >> >>>> x_pd.corr() >> > 0 1 2 3 >> > 0 1.000000 0.734367 -1.000000 -0.240192 >> > 1 0.734367 1.000000 -0.856565 -0.378777 >> > 2 -1.000000 -0.856565 1.000000 NaN >> > 3 -0.240192 -0.378777 NaN 1.000000 >> > >> >>>> np.round(np.ma.corrcoef(x_ma), 6) >> > masked_array(data = >> > [[1.0 0.734367 -1.190909 -1.681346] >> > [0.734367 1.0 -0.856565 -0.378777] >> > [-1.190909 -0.856565 1.0 --] >> > [-1.681346 -0.378777 -- 1.0]], >> > mask = >> > [[False False False False] >> > [False False False False] >> > [False False False True] >> > [False False True False]], >> > fill_value = 1e+20) >> > >> > >> > Josef >> > >> > >> >> >> >> Thanks, >> >> >> >> Faraz >> >> >> >> >> >> >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> [email protected] >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> [email protected] >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
