[
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tarek Elgamal updated SPARK-7856:
-
Description:
Currently the PCA implementation has a limitation of fitting d^2
covariance/grammian matrix entries in memory (d is the number of
columns/dimensions of the matrix). We often need only the largest k principal
components. To make pca really scalable, I suggest an implementation where the
memory usage is proportional to the principal components k rather than the full
dimensionality d.
I suggest adopting the solution described in this paper that is published in
SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf).
The paper offers an implementation for Probabilistic PCA (PPCA) which has less
memory and time complexity and could potentially scale to tall and fat matrices
rather than tall and skinny matrices that is supported by the current PCA
impelmentation.
Probablistic PCA could be potentially added to the set of algorithms supported
by MLlib and it does not necessarily replace the old PCA implementation.
PPCA implementation is adopted in Matlab's Statistics and Machine Learning
Toolbox (http://www.mathworks.com/help/stats/ppca.html)
was:
Currently the PCA implementation has a limitation of fitting d^2
covariance/grammian matrix entries in memory (d is the number of
columns/dimensions of the matrix). We often need only the largest k principal
components. To make pca really scalable, I suggest an implementation where the
memory usage is proportional to the principal components k rather than the full
dimensionality d.
I suggest adopting the solution described in this paper that is published in
SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf).
The paper offers an implementation for Probabilistic PCA (PPCA) which has less
memory and time complexity and could potentially scale to tall and fat matrices
rather than tall and skinny matrices that is supported by the current PCA
impelmentation.
Probablistic PCA could be potentially added to the set of algorithms supported
by MLlib and it does not necessarily replace the old PCA implementation.
Scalable PCA implementation for tall and fat matrices
-
Key: SPARK-7856
URL: https://issues.apache.org/jira/browse/SPARK-7856
Project: Spark
Issue Type: Bug
Components: MLlib
Reporter: Tarek Elgamal
Currently the PCA implementation has a limitation of fitting d^2
covariance/grammian matrix entries in memory (d is the number of
columns/dimensions of the matrix). We often need only the largest k principal
components. To make pca really scalable, I suggest an implementation where
the memory usage is proportional to the principal components k rather than
the full dimensionality d.
I suggest adopting the solution described in this paper that is published in
SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf).
The paper offers an implementation for Probabilistic PCA (PPCA) which has
less memory and time complexity and could potentially scale to tall and fat
matrices rather than tall and skinny matrices that is supported by the
current PCA impelmentation.
Probablistic PCA could be potentially added to the set of algorithms
supported by MLlib and it does not necessarily replace the old PCA
implementation.
PPCA implementation is adopted in Matlab's Statistics and Machine Learning
Toolbox (http://www.mathworks.com/help/stats/ppca.html)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org