GitHub user mengxr opened a pull request:
https://github.com/apache/incubator-spark/pull/575
[Proposal] Adding sparse data support and update KMeans
This is a proposal for sparse data support in mllib
(https://spark-project.atlassian.net/browse/MLLIB-18).
The idea of the proposal is that we define simple data models and factory
methods for user to provide sparse input. Then instead of writing a linear
algebra library for mllib, we take leverage on an existing linear algebra
package in implementing algorithms. So we can change the underlying
implementation without breaking the interfaces in the future. We need the
following:
* data models for sparse vectors. We need data models for dense vector,
sequential access sparse vector (backed by two parallel arrays), and random
access sparse vector (backed by a primitive-typed hash map, not in this pull
request.). Those are defined in the Vec class in this PR.
* a linear algebra package. We are considering either breeze or
mahout-math. Both have pros and cons, and we can discuss more in the JIRA. This
PR uses mahout-math. Mahout vectors do not implement serializable, so we need a
serializable wrapper class (MahoutVectorWrapper) to use in spark. As a result,
we added not only mahout-math but also mathout-core into dependencies because
we need VectorWritable defined mahout-core for the wrapper class. But we can
certainly remove most transitive dependencies of mahout-core.
* lightweight converters. The conversion between our data models and Mahout
vectors shouldn't involve data copying. However, Mahout vectors hide their
members. In this PR, Java reflection is used to get the private fields out.
This doesn't seem to be a good solution, but I didn't figure out a better one.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-spark sparse
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-spark/pull/575.patch
----
commit 37c423e746e26d9c3db23580df34a959ddf8fe44
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T08:33:42Z
add mahout-math and mahout-core to mllib
commit 6d3fda1b07683824719330243cb07275d5537f76
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T08:44:13Z
add MahoutVectorWrapper
commit ff2c072b712479f2f664a4f3828b85f46c46128e
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T08:47:35Z
add implicit conversions
commit ee53e096bb101571a559df1d78c8091bb3ba4b0a
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T09:38:12Z
add Vec and MahoutVectorHelper
commit 92be705a9c317308554d23105e6fc747764e6568
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T10:04:01Z
update LocalKMeans
commit d56239954fc41482a86993896ca9229c6dbb0756
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T10:15:02Z
use mahout in KMeans
commit 38f0cc6058c88628827c1513845de04eed5da69e
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T10:41:58Z
add Vec interface to KMeansModel
add a sparse test to KMeansSuite
commit 686fb79872421ff5cbc0e083051fae78c79186aa
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T10:49:20Z
add headers and docs
commit 273af590eb8c9b720cc5dffa1b1e03447c93362e
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T10:55:04Z
add VecSuite
commit b7e06c86f65830d8704977960698e13bc4d06070
Author: Xiangrui Meng <[email protected]>
Date: 2014-02-10T11:00:40Z
remove default constructor from MahoutVectorWrapper
----