[jira] [Updated] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2015-05-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7856:
-
Issue Type: Improvement  (was: Bug)

 Scalable PCA implementation for tall and fat matrices
 -

 Key: SPARK-7856
 URL: https://issues.apache.org/jira/browse/SPARK-7856
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tarek Elgamal

 Currently the PCA implementation has a limitation of fitting d^2 
 covariance/grammian matrix entries in memory (d is the number of 
 columns/dimensions of the matrix). We often need only the largest k principal 
 components. To make pca really scalable, I suggest an implementation where 
 the memory usage is proportional to the principal components k rather than 
 the full dimensionality d. 
 I suggest adopting the solution described in this paper that is published in 
 SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
 The paper offers an implementation for Probabilistic PCA (PPCA) which has 
 less memory and time complexity and could potentially scale to tall and fat 
 matrices rather than tall and skinny matrices that is supported by the 
 current PCA impelmentation. 
 Probablistic PCA could be potentially added to the set of algorithms 
 supported by MLlib and it does not necessarily replace the old PCA 
 implementation.
 PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
 Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2015-05-25 Thread Tarek Elgamal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarek Elgamal updated SPARK-7856:
-
Description: 
Currently the PCA implementation has a limitation of fitting d^2 
covariance/grammian matrix entries in memory (d is the number of 
columns/dimensions of the matrix). We often need only the largest k principal 
components. To make pca really scalable, I suggest an implementation where the 
memory usage is proportional to the principal components k rather than the full 
dimensionality d. 

I suggest adopting the solution described in this paper that is published in 
SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
The paper offers an implementation for Probabilistic PCA (PPCA) which has less 
memory and time complexity and could potentially scale to tall and fat matrices 
rather than tall and skinny matrices that is supported by the current PCA 
impelmentation. 
Probablistic PCA could be potentially added to the set of algorithms supported 
by MLlib and it does not necessarily replace the old PCA implementation.

PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
Toolbox (http://www.mathworks.com/help/stats/ppca.html)

  was:
Currently the PCA implementation has a limitation of fitting d^2 
covariance/grammian matrix entries in memory (d is the number of 
columns/dimensions of the matrix). We often need only the largest k principal 
components. To make pca really scalable, I suggest an implementation where the 
memory usage is proportional to the principal components k rather than the full 
dimensionality d. 

I suggest adopting the solution described in this paper that is published in 
SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
The paper offers an implementation for Probabilistic PCA (PPCA) which has less 
memory and time complexity and could potentially scale to tall and fat matrices 
rather than tall and skinny matrices that is supported by the current PCA 
impelmentation. 
Probablistic PCA could be potentially added to the set of algorithms supported 
by MLlib and it does not necessarily replace the old PCA implementation.


 Scalable PCA implementation for tall and fat matrices
 -

 Key: SPARK-7856
 URL: https://issues.apache.org/jira/browse/SPARK-7856
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Tarek Elgamal

 Currently the PCA implementation has a limitation of fitting d^2 
 covariance/grammian matrix entries in memory (d is the number of 
 columns/dimensions of the matrix). We often need only the largest k principal 
 components. To make pca really scalable, I suggest an implementation where 
 the memory usage is proportional to the principal components k rather than 
 the full dimensionality d. 
 I suggest adopting the solution described in this paper that is published in 
 SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
 The paper offers an implementation for Probabilistic PCA (PPCA) which has 
 less memory and time complexity and could potentially scale to tall and fat 
 matrices rather than tall and skinny matrices that is supported by the 
 current PCA impelmentation. 
 Probablistic PCA could be potentially added to the set of algorithms 
 supported by MLlib and it does not necessarily replace the old PCA 
 implementation.
 PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
 Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org