[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-08-19 Thread Tarek Elgamal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102179#comment-14102179
 ] 

Tarek Elgamal commented on SPARK-1782:
--

I am interested to try this new svd implementation. Is there an estimate when 
will spark 1.1.0 be officially released ?

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
 Fix For: 1.1.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-08-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102427#comment-14102427
 ] 

Xiangrui Meng commented on SPARK-1782:
--

The plan is to release v1.1 by the end of the month. The feature is available 
in both master and branch-1.1. You can also checkout the current snapshot and 
have a try.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
 Fix For: 1.1.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-06-13 Thread Li Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030954#comment-14030954
 ] 

Li Pu commented on SPARK-1782:
--

PR: https://github.com/apache/spark/pull/964

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999499#comment-13999499
 ] 

Xiangrui Meng commented on SPARK-1782:
--

Btw, this approach only gives us \Sigma and V. If we compute U via A V 
\Sigma^{-1} (current implementation in MLlib), very likely we lose 
orthogonality. For sparse SVD, the best package is PROPACK, which implements 
Lanczos bidiagonalization with partial reorthogonalization 
(http://sun.stanford.edu/~rmunk/PROPACK/). But let us use ARPACK now since we 
can call it from Breeze.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Li Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000232#comment-14000232
 ] 

Li Pu commented on SPARK-1782:
--

[~mengxr] thank you for the comments! You are right, (A^T A) x can be done in a 
single pass. What we need is actually a eigs solver. Seems that mllib depends 
on Breeze 0.7 (though I cannot find this release version in scalanlp/breeze). I 
think the code would be cleaner if we call eigs in Breeze? or do you prefer to 
have more control in mllib by calling ARPACK directly?

Also thank you for the pointer to PROPACK. I looked into the svd routine. It 
calls lapack dbdsqr for actual svd computation. I will try to find a better way 
to incorporate Fortran routines in a distributed way. For now we can use ARPACK.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999493#comment-13999493
 ] 

Xiangrui Meng commented on SPARK-1782:
--

This sounds good to me. Let's assume that A is m-by-n where n is small (1e6). 
Note that (A^T A) x = (A^T A)^T x can be computed in a single pass, which is 
sum_i (a_i^T x) a_i . So we don't need to implement A^T y, which simplifies the 
task.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000620#comment-14000620
 ] 

Xiangrui Meng commented on SPARK-1782:
--

If you need the the latest Breeze to use eigs, I would prefer calling ARPACK 
directly.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)