[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001551#comment-16001551
 ] 

Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/9/17 2:00 AM:
--------------------------------------------------------------------------

Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing separate from PCA), there are some issues with the 
current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0). Which leads to a more scalable implementation of PCA for tall and 
fat matrices.

If this ticket is for the implementation of PPCA, it should be specified in the 
title.


was (Author: elghoto):
Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0). Which leads to a more scalable implementation of PCA for tall and 
fat matrices.

If this ticket is for the implementation of PPCA, it should be specified in the 
title.

> Scalable PCA implementation for tall and fat matrices
> -----------------------------------------------------
>
>                 Key: SPARK-7856
>                 URL: https://issues.apache.org/jira/browse/SPARK-7856
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to