[
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng reassigned SPARK-1782:
------------------------------------
Assignee: Xiangrui Meng
> svd for sparse matrix using ARPACK
> ----------------------------------
>
> Key: SPARK-1782
> URL: https://issues.apache.org/jira/browse/SPARK-1782
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Li Pu
> Assignee: Xiangrui Meng
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Currently the svd implementation in mllib calls the dense matrix svd in
> breeze, which has a limitation of fitting n^2 Gram matrix entries in memory
> (n is the number of rows or number of columns of the matrix, whichever is
> smaller). In many use cases, the original matrix is sparse but the Gram
> matrix might not, and we often need only the largest k singular
> values/vectors. To make svd really scalable, the memory usage must be
> propositional to the non-zero entries in the matrix.
> One solution is to call the de facto standard eigen-decomposition package
> ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors
> of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the
> eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a
> reverse communication interface. The user provides a function to multiply a
> square matrix to be decomposed with a dense vector provided by ARPACK, and
> return the resulting dense vector to ARPACK. Inside ARPACK it uses an
> Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we
> need to provide are two matrix-vector multiplications, first M*x then M^t*x.
> These multiplications can be done in Spark in a distributed manner.
> The working memory used by ARPACK is O(n*k). When k (the number of desired
> singular values) is small, it can be easily fit into the memory of the master
> machine. The overall model is master machine runs ARPACK, and distribute
> matrix-vector multiplication onto working executors in each iteration.
> I made a PR to breeze with an ARPACK-backed svds interface
> (https://github.com/scalanlp/breeze/pull/240). The interface takes anything
> that can be multiplied by a DenseVector. On Spark/milib side, just need to
> implement the sparsematrix-vector multiplication.
> It might take some time to optimize and fully test this implementation, so
> set the workload estimate to 4 weeks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)