[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000620#comment-14000620
 ] 

Xiangrui Meng commented on SPARK-1782:
--------------------------------------

If you need the the latest Breeze to use eigs, I would prefer calling ARPACK 
directly.

> svd for sparse matrix using ARPACK
> ----------------------------------
>
>                 Key: SPARK-1782
>                 URL: https://issues.apache.org/jira/browse/SPARK-1782
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Li Pu
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Currently the svd implementation in mllib calls the dense matrix svd in 
> breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
> (n is the number of rows or number of columns of the matrix, whichever is 
> smaller). In many use cases, the original matrix is sparse but the Gram 
> matrix might not, and we often need only the largest k singular 
> values/vectors. To make svd really scalable, the memory usage must be 
> propositional to the non-zero entries in the matrix. 
> One solution is to call the de facto standard eigen-decomposition package 
> ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
> of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
> eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
> reverse communication interface. The user provides a function to multiply a 
> square matrix to be decomposed with a dense vector provided by ARPACK, and 
> return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
> Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
> need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
> These multiplications can be done in Spark in a distributed manner.
> The working memory used by ARPACK is O(n*k). When k (the number of desired 
> singular values) is small, it can be easily fit into the memory of the master 
> machine. The overall model is master machine runs ARPACK, and distribute 
> matrix-vector multiplication onto working executors in each iteration. 
> I made a PR to breeze with an ARPACK-backed svds interface 
> (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
> that can be multiplied by a DenseVector. On Spark/milib side, just need to 
> implement the sparsematrix-vector multiplication. 
> It might take some time to optimize and fully test this implementation, so 
> set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to