Hi all, As a molecular biologist by training, I'm fairly new to R (and statistics!), and was hoping for some advice. First of all, I'd like to apologise if my question is more methodological rather than relating to a specific R function. I've done my best to search both in the forum and elsewhere but can't seem to find an answer which works in practice.
I am carrying out principal component analysis in R using the Eigen function, and have applied this to my dataset and can identify patterns in my data which result from batch effects. What I'd ultimately like to do is use PCA as a noise filter to remove the batch effects from my data, and the way I would like to try and do this is by reconstructing the data, minus the components (or that part of them) which correlate with batch. As I have an extremely large dataset, as a first step I am attempting simply to apply the Eigen function to a dummy dataset consisting of only 10 individuals and ten variables derived from my originial data, then reconstruct the original dummy data matrix, just to see if I can get the basics of the method to work first. Here is the dummy matrix (adjusted by variable mean-subtraction and division by the variable standard deviation applied to each data point) I am using: adjusted_dummyset [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] Sample1 0.7329881 0.76912670 2.45906143 -0.06411602 1.2427801 0.3785717 2.34508664 1.1043552 -0.1883830 0.6503095 Sample2 -2.0446131 1.72783245 -0.40965941 -0.21713655 -0.2386781 -0.1944390 -0.25181541 0.8882743 0.8404783 0.2531209 Sample3 1.7575174 0.03851425 -0.87537424 1.82811160 0.8342636 -0.8155942 -0.90893068 -0.5529098 -1.6586626 0.1761717 Sample4 0.2731749 0.24045167 -0.39913821 -0.48525909 0.1448994 -2.0173360 -0.01073639 -0.3219478 -1.1536431 -0.2521545 Sample5 -0.2014241 -1.04646151 0.28101160 0.74348390 0.1738312 -0.8431262 -0.08842512 1.2909658 1.2013136 0.6706926 Sample6 0.1743534 -1.70657357 -0.09170187 -0.55605031 -0.2940946 1.4525891 -0.39068509 -0.3373913 -0.0533732 0.9658389 Sample7 -0.8533191 -0.34438091 -1.23890437 -0.77360636 0.5926479 0.7742632 -1.12515017 -0.5720099 0.2243808 0.5420693 Sample8 0.4176988 -0.35906123 -0.07190644 0.90045123 -1.0621902 0.2693762 -0.38033715 0.6267548 0.4767652 0.3012347 Sample9 0.1088066 -0.32197951 0.46665158 -1.72560781 0.7375796 0.2794331 1.00171777 -0.1087306 1.2519195 -0.8848459 Sample10 -0.3651829 1.00253167 -0.12004007 0.34972942 -2.1310389 0.7162622 -0.19072440 -2.0173607 -0.9407956 -2.4224372 I first multiply the matrix by the transposed matrix: > xxt=adjusted_dummyset %*% t(adjusted_dummyset) > xxt Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 Sample1 16.045163 -1.1367278 -2.5390033 -1.4762231 1.0158736 -1.8408484 -5.8176751 -1.5163240 3.5304834 -6.2647180 Sample2 -1.136728 9.0985066 -5.2175085 -0.8332460 0.7980575 -3.3607995 1.6342076 -0.3098849 -0.3462389 -0.3263661 Sample3 -2.539003 -5.2175085 12.4738749 3.7747189 -0.9562913 -1.3252885 -0.9175020 0.5849437 -6.0796087 0.2016647 Sample4 -1.476223 -0.8332460 3.7747189 6.1161097 -1.0232149 -3.0984138 -1.1213965 -1.9014901 -1.0503243 0.6134802 Sample5 1.015874 0.7980575 -0.9562913 -1.0232149 6.0758629 0.2183710 -0.9466710 2.1466445 -0.2626433 -7.0659890 Sample6 -1.840848 -3.3607995 -1.3252885 -3.0984138 0.2183710 6.6590617 3.0772441 1.0977948 0.3980588 -1.8251802 Sample7 -5.817675 1.6342076 -0.9175020 -1.1213965 -0.9466710 3.0772441 5.8681617 -0.9215294 0.1646922 -1.0195316 Sample8 -1.516324 -0.3098849 0.5849437 -1.9014901 2.1466445 1.0977948 -0.9215294 3.1757171 -2.2533118 -0.1025599 Sample9 3.530483 -0.3462389 -6.0796087 -1.0503243 -0.2626433 0.3980588 0.1646922 -2.2533118 7.2986178 -1.3997251 Sample10 -6.264718 -0.3263661 0.2016647 0.6134802 -7.0659890 -1.8251802 -1.0195316 -0.1025599 -1.3997251 17.1889249 Then I apply the Eigen function and save the eigenvectors and eigenvalues eigen(xxt,TRUE) $values [1] 2.693535e+01 1.898845e+01 1.668814e+01 1.184849e+01 8.136377e+00 4.893302e+00 2.007293e+00 4.994285e-01 3.173378e-03 -1.652399e-15 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0.60510924 0.13494101 -0.54092789 -0.11000076 0.11721749 0.37004560 -0.106993925 -0.10503855 0.19436075 -0.3162278 [2,] 0.05581480 -0.39453663 0.09160465 0.67089969 0.08088382 0.32755098 -0.009124933 0.11721316 -0.39379401 -0.3162278 [3,] -0.28666669 0.71121151 -0.01000162 0.03014755 -0.06261445 0.22739798 0.422156670 0.02330019 -0.27677054 -0.3162278 [4,] -0.13469379 0.23709870 -0.15611654 0.27760975 -0.50201617 -0.31821931 -0.584748137 0.12405010 0.11661762 -0.3162278 [5,] 0.26143053 0.12386029 0.29769330 0.19250271 0.27617639 -0.49754797 0.084826842 -0.59908097 0.02670441 -0.3162278 [6,] 0.01918933 -0.04796651 0.35560411 -0.57607725 0.07402744 0.14338749 -0.455866694 -0.01271614 -0.45276432 -0.3162278 [7,] -0.11887886 -0.18220751 0.42748177 -0.07360955 -0.34327294 0.38589551 0.142875991 -0.16600492 0.59142740 -0.3162278 [8,] -0.02725610 0.06425762 0.15301980 -0.01332073 0.53079711 -0.18890512 0.012099700 0.65880817 0.34630946 -0.3162278 [9,] 0.24706215 -0.31886828 -0.13294196 -0.25401667 -0.42308063 -0.38135353 0.482282448 0.24225089 -0.19843310 -0.3162278 [10,] -0.62111060 -0.32779019 -0.48541562 -0.14413474 0.25188193 -0.06825162 0.012492037 -0.28278192 0.04634233 -0.3162278 I also calculate: xtx=t(adjusted_dummyset) %*% adjusted_dummyset > xtx [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 9.0000000 -3.1796310 2.0417037 3.9514142 2.7275520 -1.66571377 1.5629736 -0.9104748 -4.8506341 0.68480373 [2,] -3.1796310 9.0000000 1.0981249 0.5494749 -1.2662064 -1.89318349 2.1005566 -0.5052055 -1.7947356 -3.90496189 [3,] 2.0417037 1.0981249 9.0000000 -1.1689542 2.3836861 1.22540724 8.5924385 4.2130304 1.8322104 0.72643120 [4,] 3.9514142 0.5494749 -1.1689542 9.0000000 -1.7132579 -2.51679155 -2.8679239 0.5181511 -3.9536139 0.84095707 [5,] 2.7275520 -1.2662064 2.3836861 -1.7132579 9.0000000 -2.17714376 3.1966738 4.1903147 0.7937017 5.20170437 [6,] -1.6657138 -1.8931835 1.2254072 -2.5167916 -2.1771438 9.00000000 0.3764605 -1.9821441 2.3330854 -0.08204933 [7,] 1.5629736 2.1005566 8.5924385 -2.8679239 3.1966738 0.37646053 9.0000000 3.5708620 1.7809104 -0.28160124 [8,] -0.9104748 -0.5052055 4.2130304 0.5181511 4.1903147 -1.98214408 3.5708620 9.0000000 5.3281685 6.32863320 [9,] -4.8506341 -1.7947356 1.8322104 -3.9536139 0.7937017 2.33308543 1.7809104 5.3281685 9.0000000 2.27959505 [10,] 0.6848037 -3.9049619 0.7264312 0.8409571 5.2017044 -0.08204933 -0.2816012 6.3286332 2.2795951 9.00000000 > and apply the eigen function to this: > eigen(xtx,TRUE) $values [1] 2.693535e+01 1.898845e+01 1.668814e+01 1.184849e+01 8.136377e+00 4.893302e+00 2.007293e+00 4.994285e-01 3.173378e-03 6.009067e-16 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] -0.01604003 0.56323526 -0.190946669 -0.40105857 -0.002543387 0.17417910 0.25192284 -0.59870554 0.18090811 -0.01463338 [2,] 0.08466679 -0.16695878 -0.455330886 0.54953906 -0.002740213 -0.47379253 0.19091718 -0.36818489 0.23997046 0.03943000 [3,] -0.42016626 -0.01672674 -0.438168969 -0.16986221 -0.259624942 0.03774746 -0.20059192 0.33694834 0.50122476 -0.35847979 [4,] 0.17380688 0.46248616 -0.009158497 0.19699882 -0.641525880 -0.08643216 0.32002024 0.38196270 -0.08822512 0.20466364 [5,] -0.38230998 0.27820625 0.061021059 0.01793892 0.556873562 -0.31199982 0.48447946 0.33729239 -0.05542533 -0.11569124 [6,] -0.01078626 -0.35623684 0.086516360 -0.57830505 -0.317806785 -0.56648878 0.22620425 -0.09749859 -0.18851088 -0.11373558 [7,] -0.41357826 -0.06924232 -0.495909637 -0.11281244 0.008420391 0.07292844 -0.09886884 0.01075021 -0.44335225 0.59469674 [8,] -0.48705511 0.07929178 0.158922061 0.33799785 -0.279059362 0.07881225 -0.04227138 -0.30875079 -0.45445853 -0.47881039 [9,] -0.33349167 -0.40387477 0.287225004 0.07874711 -0.163416914 0.37364320 0.52438989 -0.09014761 0.33696991 0.27201980 [10,] -0.34649980 0.24943468 0.446371805 0.03960183 -0.072567359 -0.40852699 -0.43011632 -0.14042978 0.30891048 0.38025985 I create a diagonal matrix containing the square roots of the eigenvalues along the diagonal: > diagonalmatrix_eigenvaluesqrts V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 5.61 0.00 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.0000 2 0.00 1.41 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.0000 3 0.00 0.00 1.12 0.000 0.000 0.000 0.000 0.000 0.000 0.0000 4 0.00 0.00 0.00 0.941 0.000 0.000 0.000 0.000 0.000 0.0000 5 0.00 0.00 0.00 0.000 0.727 0.000 0.000 0.000 0.000 0.0000 6 0.00 0.00 0.00 0.000 0.000 0.471 0.000 0.000 0.000 0.0000 7 0.00 0.00 0.00 0.000 0.000 0.000 0.296 0.000 0.000 0.0000 8 0.00 0.00 0.00 0.000 0.000 0.000 0.000 0.246 0.000 0.0000 9 0.00 0.00 0.00 0.000 0.000 0.000 0.000 0.000 0.159 0.0000 10 0.00 0.00 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.0127 I have been operating under the assumption that the original data matrix could be reconstructed by multiplying the eigenvectors from xxt X a diagonal matrix with the square roots of the eigenvalues along the diagonal X the eigenvectors from xtx as follows: Reconstructed_dummyset=eigenvectors_xxt %*% as.matrix(diagonalmatrix_eigenvaluesqrts) %*% t(eigenvectors_xtx) > Reconstructed_dummyset [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0.253193770 0.402535986 -1.14743281 0.5698820 -1.31593686 -0.23278020 -1.10482209 -1.78241789 -1.344859283 -1.4554367 [2,] -0.593623956 0.320035080 -0.30357878 -0.1156343 -0.25992539 -0.25866729 -0.17538542 0.05041073 0.221261845 -0.2675080 [3,] 0.620290857 -0.322491313 0.63113668 0.2551444 0.90025477 -0.36701269 0.61029139 0.90661927 0.226875492 0.6933749 [4,] 0.045394154 0.135401863 0.48897212 0.2772398 0.14530592 -0.12141324 0.33943264 0.53647586 0.002607645 0.4395889 [5,] -0.007416873 0.213143606 -0.90810973 0.2103079 -0.34091303 -0.06529543 -0.82817426 -0.61633082 -0.543192842 -0.2177753 [6,] 0.068080740 -0.533029067 -0.14658587 -0.2022178 -0.09846926 0.29932974 -0.12798021 -0.14611192 -0.016267532 0.1053571 [7,] -0.113178997 -0.309552826 0.18447270 -0.1195559 0.01775153 0.15282100 0.02644529 0.41110216 0.623380590 0.3368958 [8,] -0.076800910 -0.117830736 -0.03085105 -0.1712106 0.39036065 -0.10570034 -0.05526268 -0.08333122 -0.028936234 0.1511139 [9,] -0.187036903 0.212383109 -0.41884937 0.2704928 -0.69211148 0.50295414 -0.45885438 -0.75166730 -0.301317053 -0.6536756 [10,] -0.008314191 -0.002179248 1.66522298 -0.9826679 1.25832941 0.20033204 1.75042615 1.49448058 1.149522805 0.8527934 However as you can see I've clearly got my wires crossed somewhere! I'd really appreciate any advice you could offer... Thanks! Julia -- View this message in context: http://r.789695.n4.nabble.com/Data-reconstruction-following-PCA-using-Eigen-function-tp2226535p2226535.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.