[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-7856: ------------------------------------- Issue Type: Improvement (was: Bug) > Scalable PCA implementation for tall and fat matrices > ----------------------------------------------------- > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org