GitHub user rezazadeh opened a pull request: https://github.com/apache/incubator-spark/pull/564
Principal Component Analysis # Principal Component Analysis Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm. # Testing Tests included: * All principal components * Only top k principal components # Documentation # Example Usage import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.PCA import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 val k = 1 // recover top principal component val coeffs = PCA.computePCA(SparseMatrix(data, m, n), k) {% endhighlight %} You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-spark pca Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/564.patch ---- commit 0642afb2ec1ca6896ffd1a4d3b12eca3f4db52b3 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-02T05:53:33Z Initial files commit 371f40ae288d45986c364adcfe4b584a9b00aa3d Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T01:50:59Z new interfaces commit 173148288dffe6cfa1d6671fa8dd9c57499fd0e8 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T04:04:46Z add option to compute U commit fb022fcc857bc3bbbb793882587480671b3e0b23 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T08:48:24Z new tests, SVD interface commit f756aff7b322504f09236f3ad4e05d4b75e8cc42 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T08:49:47Z fix tests commit 2d831f8f734ddf207707b721aa9718ebd7e65ca9 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T09:04:48Z Documentation, yo commit 31a5ecf977e6e4e6cd4d038aaa9f3d1ad1b3de49 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T09:15:23Z added mllib guide docs commit 57fe6d4ed9e214a504dbb2c5c66205045d5846b5 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T09:18:07Z SparkPCA example commit 07657476d3be2bd177090aaa37f6a4357329a188 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T09:22:15Z fix typo commit b45c1e88cb36ce2e5c78f493b05455f87ecfc662 Author: Reza Zadeh <riz...@gmail.com> Date: 2014-02-08T09:23:15Z fix example ----