matrix computation in spark
Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: matrix computation in spark
There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: matrix computation in spark
Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide can be download from the archive here http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd Best, Rong 2014-11-18 13:11 GMT+08:00 顾荣 gurongwal...@gmail.com: Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide is also attached in this mail. Best, Rong 2014-11-18 11:36 GMT+08:00 Zongheng Yang zonghen...@gmail.com: There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/ -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
答复: matrix computation in spark
Hi, I checked the work of ml-matrix. For now, it doesn’t include matrix multiply and LU decomposition. What’s your plan? Can we contribute our work to these parts? Otherwise, the block number of row/column is decided manually, As we mentioned, the CARMA method in paper is communication-optimal. 发件人: Zongheng Yang [mailto:zonghen...@gmail.com] 发送时间: 2014年11月18日 11:37 收件人: liaoyuxi; d...@spark.incubator.apache.org 抄送: Shivaram Venkataraman 主题: Re: matrix computation in spark There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.commailto:liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: matrix computation in spark
Hi Yuxi, We are integrating the ml-matrix from the AMPlab repo into MLlib, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-3434 We already have matrix multiply, but are missing LU decomposition. Could you please track that JIRA, once the initial design is in, we can sync on how to contribute LU decomposition. Let's move the discussion to the JIRA. Thanks! On Mon, Nov 17, 2014 at 9:49 PM, 顾荣 gurongwal...@gmail.com wrote: Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide can be download from the archive here http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd Best, Rong 2014-11-18 13:11 GMT+08:00 顾荣 gurongwal...@gmail.com: Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide is also attached in this mail. Best, Rong 2014-11-18 11:36 GMT+08:00 Zongheng Yang zonghen...@gmail.com: There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/ -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/