[jira] [Updated] (SPARK-12861) Changes to support KMeans with large feature space

Sean Owen (JIRA) Sun, 17 Jan 2016 00:26:19 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-12861:
------------------------------
    Affects Version/s:     (was: 1.6.1)
                       1.6.0
     Target Version/s:   (was: 1.6.1)
        Fix Version/s:     (was: 1.6.1)
          Component/s: ML

[~levin.r...@gmail.com] don't set target (for committers), fix version (it's 
not resolved)

> Changes to support KMeans with large feature space
> --------------------------------------------------
>
>                 Key: SPARK-12861
>                 URL: https://issues.apache.org/jira/browse/SPARK-12861
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.6.0
>            Reporter: Roy Levin
>              Labels: patch
>
>     The problem:
>     -----------------
>     In Spark's KMeans code the center vectors are always represented as dense 
> vectors. As a result, when each such center has a large domain space the 
> algorithm quickly runs out of memory. In my example I have a feature space of 
> around 50000 and k ~= 500. This sums up to around 200MB RAM for the center 
> vectors alone while in fact the center vectors are very sparse and require a 
> lot less RAM.
>     Since I am running on a system with relatively low resources I keep 
> getting OutOfMemory errors. In my setting it is OK to trade off runtime for 
> using less RAM. This is what I set out to do in my solution while allowing 
> users the flexibility to choose.
>     One solution could be to reduce the dimensions of the feature space but 
> this is not always the best approach. For example, when the object space is 
> comprised of users and the feature space of items. In such an example we may 
> want to run kmeans over a feature space which is a function of how many times 
> user i clicked item j. If we reduce the dimensions of the items we will not 
> be able to map the centers vectors back to the items. Moreover in a streaming 
> context detecting the changes WRT previous runs gets more difficult.
>     My solution:
>     ----------------
>     Allow the kmeans algorithm to accept a VectorFactory which decides when 
> vectors used inside the algorithm should be sparse and when they should be 
> dense. For backward compatibility the default behavior is to always make them 
> dense (like the situation is now). But now potentially the user can provide a 
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to 
> make vectors sparse.
>     For this I made the following changes:
>     (1) Added a method called reassign to SparseVectors allowing to change 
> the indices and values
>     (2) Allow axpy to accept SparseVectors
>     (3) create a trait called VectorFactory and two implementations for it 
> that are used within KMeans code
>     To get the above described solution do the following:
>     git clone https://github.com/levin-royl/spark.git -b 
> SupportLargeFeatureDomains



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12861) Changes to support KMeans with large feature space

Reply via email to