[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030954#comment-14030954 ]
Li Pu commented on SPARK-1782: ------------------------------ PR: https://github.com/apache/spark/pull/964 > svd for sparse matrix using ARPACK > ---------------------------------- > > Key: SPARK-1782 > URL: https://issues.apache.org/jira/browse/SPARK-1782 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Li Pu > Original Estimate: 672h > Remaining Estimate: 672h > > Currently the svd implementation in mllib calls the dense matrix svd in > breeze, which has a limitation of fitting n^2 Gram matrix entries in memory > (n is the number of rows or number of columns of the matrix, whichever is > smaller). In many use cases, the original matrix is sparse but the Gram > matrix might not, and we often need only the largest k singular > values/vectors. To make svd really scalable, the memory usage must be > propositional to the non-zero entries in the matrix. > One solution is to call the de facto standard eigen-decomposition package > ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors > of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the > eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a > reverse communication interface. The user provides a function to multiply a > square matrix to be decomposed with a dense vector provided by ARPACK, and > return the resulting dense vector to ARPACK. Inside ARPACK it uses an > Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we > need to provide are two matrix-vector multiplications, first M*x then M^t*x. > These multiplications can be done in Spark in a distributed manner. > The working memory used by ARPACK is O(n*k). When k (the number of desired > singular values) is small, it can be easily fit into the memory of the master > machine. The overall model is master machine runs ARPACK, and distribute > matrix-vector multiplication onto working executors in each iteration. > I made a PR to breeze with an ARPACK-backed svds interface > (https://github.com/scalanlp/breeze/pull/240). The interface takes anything > that can be multiplied by a DenseVector. On Spark/milib side, just need to > implement the sparsematrix-vector multiplication. > It might take some time to optimize and fully test this implementation, so > set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)