GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/22784

    [SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix

    ## What changes were proposed in this pull request?
    Spark PCA supports maximum only ~65,535 columns matrix. This is due to the 
fact that, it computes the Covariance matrix first, then compute principle 
components. The main bottle neck was computing **covariance matrix.** The limit 
65,500 came due to the integer size limit. Because we are passing an array of 
size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. 
so, the maximum column size we can give is 65,500.
    
    Currently we don't have such limitation for computing SVD in spark.  So, we 
can make use of Spark SVD to compute the PCA, if the number of columns exceeds 
the limit.
    
    Computation of PCA can be done directly using SVD of matrix, instead of 
finding the covariance matrix.
    Following are the papers/links for the reference.
    
    
    ## How was this patch tested?
    added UT, also manually verified with the existing test for pca, by 
removing the limit condition in the fit method.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark PCA

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22784.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22784
    
----
commit 042b546c345dcce7d139b16eaf2378ffc556134f
Author: Shahid <shahidki31@...>
Date:   2018-10-20T18:24:49Z

    PCA: number of columns more than 65500

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to