On Thu, Jan 26, 2012 at 3:58 PM, Bruce Southey <bsout...@gmail.com> wrote: > On Thu, Jan 26, 2012 at 12:45 PM, <josef.p...@gmail.com> wrote: >> On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsout...@gmail.com> wrote: >>> On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig >>> <pierre.haes...@crans.org> wrote: >>>> Le 26/01/2012 15:57, Bruce Southey a écrit : >>>>> Can you please provide a >>>>> couple of real examples with expected output that clearly show what >>>>> you want? >>>>> >>>> Hi Bruce, >>>> >>>> Thanks for your ticket feedback ! It's precisely because I see a big >>>> potential impact of the proposed change that I send first a ML message, >>>> second a ticket before jumping to a pull-request like a Sergio Leone's >>>> cowboy (sorry, I watched "for a few dollars more" last weekend...) >>>> >>>> Now, I realize that in the ticket writing I made the wrong trade-off >>>> between conciseness and accuracy which led to some of the errors you >>>> raised. Let me try to use your example to try to share what I have in mind. >>>> >>>>> >> X = array([-2.1, -1. , 4.3]) >>>>> >> Y = array([ 3. , 1.1 , 0.12]) >>>> >>>> Indeed, with today's cov behavior we have a 2x2 array: >>>>> >> cov(X,Y) >>>> array([[ 11.71 , -4.286 ], >>>> [ -4.286 , 2.14413333]]) >>>> >>>> Now, when I used the word 'concatenation', I wasn't precise enough >>>> because I meant assembling X and Y in the sense of 2 vectors of >>>> observations from 2 random variables X and Y. >>>> This is achieved by concatenate(X,Y) *when properly playing with >>>> dimensions* (which I didn't mentioned) : >>>>> >> XY = np.concatenate((X[None, :], Y[None, :])) >>>> array([[-2.1 , -1. , 4.3 ], >>>> [ 3. , 1.1 , 0.12]]) >>> >>> In this context, I find stacking, np.vstack((X,Y)), more appropriate >>> than concatenate. >>> >>>> >>>> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)". >>>>> >> np.cov(XY) >>>> array([[ 11.71 , -4.286 ], >>>> [ -4.286 , 2.14413333]]) >>>> >>> Sure the resulting array is the same but whole process is totally different. >>> >>> >>>> (And indeed, the actual cov Python code does use concatenate() ) >>> Yes, but the user does not see that. Whereas you are forcing the user >>> to do the stacking in the correct dimensions. >>> >>> >>>> >>>> >>>> Now let me come back to my assertion about this behavior *usefulness*. >>>> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 >>>> simple scalars blocks). >>> No there are not '4' blocks just rows and columns. >> >> Sturla showed the 4 blocks in his first message. >> > Well, I could not follow that because the code is wrong. > X = np.array([-2.1, -1. , 4.3]) >>>> cX = X - X.mean(axis=0)[np.newaxis,:] > > Traceback (most recent call last): > File "<pyshell#6>", line 1, in <module> > cX = X - X.mean(axis=0)[np.newaxis,:] > IndexError: 0-d arrays can only use a single () or a list of newaxes > (and a single ...) as an index > whoops! > > Anyhow, variance-covariance matrix is symmetric but numpy or scipy > lacks lapac's symmetrix matrix > (http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html) > >>> >>>> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes >>>> to var(X) and var(Y) when setting ddof to 1) >>> Sure but variances are still covariances. >>> >>>> * off diagonal blocks are symetric and are actually the covariance >>>> estimate of X, Y observations (from >>>> http://en.wikipedia.org/wiki/Covariance) >>> Sure >>>> >>>> that is : >>>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) >>>> -4.2860000000000005 >>>> >>>> The new proposed behaviour for cov is that cov(X,Y) would return : >>>> array(-4.2860000000000005) instead of the 2*2 matrix. >>> >>> But how you interpret an 2D array where the rows are greater than 2? >>>>>> Z=Y+X >>>>>> np.cov(np.vstack((X,Y,Z))) >>> array([[ 11.71 , -4.286 , 7.424 ], >>> [ -4.286 , 2.14413333, -2.14186667], >>> [ 7.424 , -2.14186667, 5.28213333]]) >>> >>> >>>> >>>> * This would be in line with the cov(X,Y) mathematical definition, as >>>> well as with R behavior. >>> I don't care what R does because I am using Python and Python is >>> infinitely better than R is! >>> >>> But I think that is only in the 1D case. >> >> I just checked R to make sure I remember correctly >> >>> xx = matrix((1:20)^2, nrow=4) >>> xx >> [,1] [,2] [,3] [,4] [,5] >> [1,] 1 25 81 169 289 >> [2,] 4 36 100 196 324 >> [3,] 9 49 121 225 361 >> [4,] 16 64 144 256 400 >>> cov(xx, 2*xx[,1:2]) >> [,1] [,2] >> [1,] 86.0000 219.3333 >> [2,] 219.3333 566.0000 >> [3,] 352.6667 912.6667 >> [4,] 486.0000 1259.3333 >> [5,] 619.3333 1606.0000 >>> cov(xx) >> [,1] [,2] [,3] [,4] [,5] >> [1,] 43.0000 109.6667 176.3333 243.0000 309.6667 >> [2,] 109.6667 283.0000 456.3333 629.6667 803.0000 >> [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333 >> [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667 >> [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000 >> >> >>> >>>> * This would save memory and computing resources. (and therefore help >>>> save the planet ;-) ) >>> Nothing that you have provided shows that it will. >> >> I don't know about saving the planet, but if X and Y have the same >> number of columns, we save 3 quarters of the calculations, as Sturla >> also explained in his first message. >> > Can not figure those savings: > For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%) > a 3 by 3 output has 6 covariances > a 5 by 5 output 15 covariances
what numpy calculates are 4, 9 and 25 covariances, we might care only about 1, 2 and 4 of them. > > If you want to save memory and calculation then use symmetric storage > and associated methods. actually for covariance matrix we stilll need to subtract means, so we won't save 75%, but we save 75% in the cross-product. suppose X and Y are (nobs, k_x) and (nobs, k_y) (means already subtracted) (and ignoring that numpy "likes" rows instead of columns) the partitioned dot product [X,Y]'[X,Y] is [[ X'X, X'Y], [Y'X, Y'Y]] X'Y is (n_x, n_y) total shape is (n_x + n_y, n_x + n_y) If we are only interested in X'Y, we don't need the other three submatrices. If n_x = 99 and n_y is 1, we save .... ? (we get a (99,1) instead of a (100, 100) matrix) and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so exploiting symmetry is a different issue. Josef > > Bruce > >> Josef >> >>> >>>> >>>> However, I do understand that the impact for this change may be big. >>>> This indeed requires careful reviewing. >>>> >>>> Pierre >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> Bruce >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion