GitHub user shahidki31 opened a pull request: https://github.com/apache/spark/pull/22784
[SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix ## What changes were proposed in this pull request? Spark PCA supports maximum only ~65,535 columns matrix. This is due to the fact that, it computes the Covariance matrix first, then compute principle components. The main bottle neck was computing **covariance matrix.** The limit 65,500 came due to the integer size limit. Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. Currently we don't have such limitation for computing SVD in spark. So, we can make use of Spark SVD to compute the PCA, if the number of columns exceeds the limit. Computation of PCA can be done directly using SVD of matrix, instead of finding the covariance matrix. Following are the papers/links for the reference. ## How was this patch tested? added UT, also manually verified with the existing test for pca, by removing the limit condition in the fit method. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shahidki31/spark PCA Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22784.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22784 ---- commit 042b546c345dcce7d139b16eaf2378ffc556134f Author: Shahid <shahidki31@...> Date: 2018-10-20T18:24:49Z PCA: number of columns more than 65500 ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org