GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/575
[Proposal] Adding sparse data support and update KMeans This is a proposal for sparse data support in mllib (https://spark-project.atlassian.net/browse/MLLIB-18). The idea of the proposal is that we define simple data models and factory methods for user to provide sparse input. Then instead of writing a linear algebra library for mllib, we take leverage on an existing linear algebra package in implementing algorithms. So we can change the underlying implementation without breaking the interfaces in the future. We need the following: * data models for sparse vectors. We need data models for dense vector, sequential access sparse vector (backed by two parallel arrays), and random access sparse vector (backed by a primitive-typed hash map, not in this pull request.). Those are defined in the Vec class in this PR. * a linear algebra package. We are considering either breeze or mahout-math. Both have pros and cons, and we can discuss more in the JIRA. This PR uses mahout-math. Mahout vectors do not implement serializable, so we need a serializable wrapper class (MahoutVectorWrapper) to use in spark. As a result, we added not only mahout-math but also mathout-core into dependencies because we need VectorWritable defined mahout-core for the wrapper class. But we can certainly remove most transitive dependencies of mahout-core. * lightweight converters. The conversion between our data models and Mahout vectors shouldn't involve data copying. However, Mahout vectors hide their members. In this PR, Java reflection is used to get the private fields out. This doesn't seem to be a good solution, but I didn't figure out a better one. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-spark sparse Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/575.patch ---- commit 37c423e746e26d9c3db23580df34a959ddf8fe44 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T08:33:42Z add mahout-math and mahout-core to mllib commit 6d3fda1b07683824719330243cb07275d5537f76 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T08:44:13Z add MahoutVectorWrapper commit ff2c072b712479f2f664a4f3828b85f46c46128e Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T08:47:35Z add implicit conversions commit ee53e096bb101571a559df1d78c8091bb3ba4b0a Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T09:38:12Z add Vec and MahoutVectorHelper commit 92be705a9c317308554d23105e6fc747764e6568 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T10:04:01Z update LocalKMeans commit d56239954fc41482a86993896ca9229c6dbb0756 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T10:15:02Z use mahout in KMeans commit 38f0cc6058c88628827c1513845de04eed5da69e Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T10:41:58Z add Vec interface to KMeansModel add a sparse test to KMeansSuite commit 686fb79872421ff5cbc0e083051fae78c79186aa Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T10:49:20Z add headers and docs commit 273af590eb8c9b720cc5dffa1b1e03447c93362e Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T10:55:04Z add VecSuite commit b7e06c86f65830d8704977960698e13bc4d06070 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T11:00:40Z remove default constructor from MahoutVectorWrapper ----