[R] Questions about results from PCAproj for robust principal component analysis

Talbot Katz Tue, 13 Feb 2007 12:38:50 -0800

Hi.

I have been looking at the PCAproj function in package pcaPP (R 2.4.1) for 
robust principal components, and I'm trying to interpret the results.  I 
started with a data matrix of dimensions RxC (R is the number of rows / 
observations, C the number of columns / variables).  PCAproj returns a list 
of class princomp, similar to the output of the function princomp.  In a 
case where I can run princomp, I would get the following, from executing  
dmpca = princomp(datamatrix) :
-       the vector, sdev, of length C, contains the standard deviations of the 
components in
                order by descending value; the squares are the eigenvalues of 
the
                covariance matrix
-       the matrix, loadings, has dimension CxC, and the columns are the 
eigenvectors of the
                covariance matrix, in the same order as the sdev vector; the 
columns are
                orthonormal:
                sum(dmpca$loadings[,i]*dmpca$loadings[,j]) = 1 if i == j, ~ 0 
if i != j
-       the vector, center, of length C, contains the means of the variable 
columns in the original
                data matrix, in the same order as the original columns
-       the vector, scale, of length C, contains the scalings applied to each 
variable, in the same
                order as the original columns
-       n.obs contains the number of observations used in the computation; this 
number equals
                R when there is no missing data
-       the matrix, scores, has dimension RxC, and it can be thought of as the 
projection of the
                eigenvector matrix, loadings, back onto the original data; 
these columns 
of
                scores are the principal components.  princomp typically 
removes the mean,
                so the formula is:
                dmpca$scores = t(t(datamatrix) - dmpca$center)%*%dmpca$loadings
                and apply(dmpca$scores,2,mean) returns a length C vector of 
(effectively)
                zeroes; also the principal components (columns of scores) are 
orthogonal
                (but not orthonormal):
                sum(dmpca$scores[,i]*dmpca$scores[,j]) ~ 0 if i != j, > 0 if i 
== j
-       call contains the function call, in this case princomp(x = datamatrix)


That is all as it should be.


In my case R < C, which produces singular results for standard PCA, but 
robust methods, like PCAproj, are designed to handle this.  Also, I had 
"de-meaned" the data beforehand, so apply(datamatrix,2,mean) produces a 
length C vector of (effectively) zeroes.  I ran the following:
dmpcaprj=PCAproj(datamatrix,k=4,CalcMethod="sphere",update=TRUE)
to get the first four robust components.  When I look at the princomp object 
returned as dmpcaprj, some of the results are just what I expect.  For 
example,
-       dmpcaprj$loadings has dimensions Cx4, as expected, and the first four 
eigenvectors of
                the (robust) covariance matrix are orthonormal:
                sum(dmpcaprj$loadings[,i]*dmpcaprj$loadings[,j]) = 1 if i == j, 
~ 0 if i 
!= j
-       dmpcaprj$sdev contains the square roots of the four corresponding 
eigenvalues.
-       dmpcaprj$n.obs equals R.
-       dmpcaprj$scores has dimensions Rx4, as it should.

HOWEVER, the columns of dmpcaprj$scores are neither de-meaned nor 
orthogonal.  So,
                apply(dmpcaprj$scores,2,mean) is a non-zero vector, and
                sum(dmpcaprj$scores[,i]*dmpcaprj$scores[,j]) != 0 if i != j, > 
0 if i == j
ALSO,
-       dmpcaprj$scale is in this case a vector of all 1's, as expected.  But 
the 
length is C, not R.
-       dmpcaprj$center is a vector of length C, not R, and the entries are not 
equal to either
                apply(datamatrix,1,mean)  or  apply(datamatrix,2,mean); I can't 
figure out
                where they came from.

One interesting thing is that the columns of the Rx4 matrix,
                dmpcaprj$scores - datamatrix%*%dmpcaprj$loadings
are all identically constant vectors, such that each row equals 
apply(dmpcaprj$scores,2,mean), since
apply(datamatrix%*%dmpcaprj$loadings,2,mean) is a length four vector of 
(effectively) zeroes,
but I can't interpret the values of these means of dmpcaprj$scores.


Can anyone please explain to me what is happening with the scores, scale, 
and center parts of the PCAproj results?


Thanks!


--  TMK  --
212-460-5430    home
917-656-5351    cell

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Questions about results from PCAproj for robust principal component analysis

Reply via email to