GitHub user rezazadeh opened a pull request:
https://github.com/apache/incubator-spark/pull/564
Principal Component Analysis
# Principal Component Analysis
Computes the top k principal component coefficients for the m-by-n data
matrix X. Rows of X correspond to observations and columns correspond to
variables. The coefficient matrix is n-by-k. Each column of coeff contains
coefficients for one principal component, and the columns are in descending
order of component variance. This function centers the data and uses the
singular value decomposition (SVD) algorithm.
# Testing
Tests included:
* All principal components
* Only top k principal components
# Documentation
# Example Usage
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.PCA
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixEntry
// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
val parts = line.split(',')
MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4
val k = 1
// recover top principal component
val coeffs = PCA.computePCA(SparseMatrix(data, m, n), k)
{% endhighlight %}
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-spark pca
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-spark/pull/564.patch
----
commit 0642afb2ec1ca6896ffd1a4d3b12eca3f4db52b3
Author: Reza Zadeh <[email protected]>
Date: 2014-02-02T05:53:33Z
Initial files
commit 371f40ae288d45986c364adcfe4b584a9b00aa3d
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T01:50:59Z
new interfaces
commit 173148288dffe6cfa1d6671fa8dd9c57499fd0e8
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T04:04:46Z
add option to compute U
commit fb022fcc857bc3bbbb793882587480671b3e0b23
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T08:48:24Z
new tests, SVD interface
commit f756aff7b322504f09236f3ad4e05d4b75e8cc42
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T08:49:47Z
fix tests
commit 2d831f8f734ddf207707b721aa9718ebd7e65ca9
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T09:04:48Z
Documentation, yo
commit 31a5ecf977e6e4e6cd4d038aaa9f3d1ad1b3de49
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T09:15:23Z
added mllib guide docs
commit 57fe6d4ed9e214a504dbb2c5c66205045d5846b5
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T09:18:07Z
SparkPCA example
commit 07657476d3be2bd177090aaa37f6a4357329a188
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T09:22:15Z
fix typo
commit b45c1e88cb36ce2e5c78f493b05455f87ecfc662
Author: Reza Zadeh <[email protected]>
Date: 2014-02-08T09:23:15Z
fix example
----