GitHub user mengxr opened a pull request:

    https://github.com/apache/incubator-spark/pull/575

    [Proposal] Adding sparse data support and update KMeans

    This is a proposal for sparse data support in mllib 
(https://spark-project.atlassian.net/browse/MLLIB-18). 
    
    The idea of the proposal is that we define simple data models and factory 
methods for user to provide sparse input. Then instead of writing a linear 
algebra library for mllib, we take leverage on an existing linear algebra 
package in implementing algorithms. So we can change the underlying 
implementation without breaking the interfaces in the future. We need the 
following:
    
    * data models for sparse vectors. We need data models for dense vector, 
sequential access sparse vector (backed by two parallel arrays), and random 
access sparse vector (backed by a primitive-typed hash map, not in this pull 
request.). Those are defined in the Vec class in this PR.
    * a linear algebra package. We are considering either breeze or 
mahout-math. Both have pros and cons, and we can discuss more in the JIRA. This 
PR uses mahout-math. Mahout vectors do not implement serializable, so we need a 
serializable wrapper class (MahoutVectorWrapper) to use in spark. As a result, 
we added not only mahout-math but also mathout-core into dependencies because 
we need VectorWritable defined mahout-core for the wrapper class. But we can 
certainly remove most transitive dependencies of mahout-core.
    * lightweight converters. The conversion between our data models and Mahout 
vectors shouldn't involve data copying. However, Mahout vectors hide their 
members. In this PR, Java reflection is used to get the private fields out. 
This doesn't seem to be a good solution, but I didn't figure out a better one.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-spark sparse

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-spark/pull/575.patch

----
commit 37c423e746e26d9c3db23580df34a959ddf8fe44
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T08:33:42Z

    add mahout-math and mahout-core to mllib

commit 6d3fda1b07683824719330243cb07275d5537f76
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T08:44:13Z

    add MahoutVectorWrapper

commit ff2c072b712479f2f664a4f3828b85f46c46128e
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T08:47:35Z

    add implicit conversions

commit ee53e096bb101571a559df1d78c8091bb3ba4b0a
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T09:38:12Z

    add Vec and MahoutVectorHelper

commit 92be705a9c317308554d23105e6fc747764e6568
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T10:04:01Z

    update LocalKMeans

commit d56239954fc41482a86993896ca9229c6dbb0756
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T10:15:02Z

    use mahout in KMeans

commit 38f0cc6058c88628827c1513845de04eed5da69e
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T10:41:58Z

    add Vec interface to KMeansModel
    add a sparse test to KMeansSuite

commit 686fb79872421ff5cbc0e083051fae78c79186aa
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T10:49:20Z

    add headers and docs

commit 273af590eb8c9b720cc5dffa1b1e03447c93362e
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T10:55:04Z

    add VecSuite

commit b7e06c86f65830d8704977960698e13bc4d06070
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-02-10T11:00:40Z

    remove default constructor from MahoutVectorWrapper

----

Reply via email to