[jira] [Commented] (SPARK-12861) Changes to support KMeans with large feature space

Roy Levin (JIRA) Thu, 28 Jan 2016 05:32:15 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121428#comment-15121428
 ]


Roy Levin commented on SPARK-12861:
-----------------------------------

Please note that I also explain the difference and why the proposed solution to 
4039 (reducing the dimensions) cannot work in the case I describe. So I do not 
think this is a duplicate. In any case, if the issue is closed what does it 
mean WRT to the changes in the code I implemented?

> Changes to support KMeans with large feature space
> --------------------------------------------------
>
>                 Key: SPARK-12861
>                 URL: https://issues.apache.org/jira/browse/SPARK-12861
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.6.0
>            Reporter: Roy Levin
>              Labels: patch
>
>     The problem:
>     -----------------
>     In Spark's KMeans code the center vectors are always represented as dense 
> vectors. As a result, when each such center has a large domain space the 
> algorithm quickly runs out of memory. In my example I have a feature space of 
> around 50000 and k ~= 500. This sums up to around 200MB RAM for the center 
> vectors alone while in fact the center vectors are very sparse and require a 
> lot less RAM.
>     Since I am running on a system with relatively low resources I keep 
> getting OutOfMemory errors. In my setting it is OK to trade off runtime for 
> using less RAM. This is what I set out to do in my solution while allowing 
> users the flexibility to choose.
>     One solution could be to reduce the dimensions of the feature space but 
> this is not always the best approach. For example, when the object space is 
> comprised of users and the feature space of items. In such an example we may 
> want to run kmeans over a feature space which is a function of how many times 
> user i clicked item j. If we reduce the dimensions of the items we will not 
> be able to map the centers vectors back to the items. Moreover in a streaming 
> context detecting the changes WRT previous runs gets more difficult.
>     My solution:
>     ----------------
>     Allow the kmeans algorithm to accept a VectorFactory which decides when 
> vectors used inside the algorithm should be sparse and when they should be 
> dense. For backward compatibility the default behavior is to always make them 
> dense (like the situation is now). But now potentially the user can provide a 
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to 
> make vectors sparse.
>     For this I made the following changes:
>     (1) Added a method called reassign to SparseVectors allowing to change 
> the indices and values
>     (2) Allow axpy to accept SparseVectors
>     (3) create a trait called VectorFactory and two implementations for it 
> that are used within KMeans code
>     To get the above described solution do the following:
>     git clone https://github.com/levin-royl/spark.git -b 
> SupportLargeFeatureDomains
> Note
> ------
> There are some similar issues opened in JIRA in the past, e.g.:
> https://issues.apache.org/jira/browse/SPARK-4039
> https://issues.apache.org/jira/browse/SPARK-1212
> https://github.com/mesos/spark/pull/736
> But the difference is that in the problem I describe reducing the dimensions 
> of the problem (i.e., the feature space) to allow using dense vectors is not 
> suitable. Also, the solution I implemented supports this while allowing full 
> flexibility to the user --- i.e., using the default dense vector 
> implementation or selecting an alternative (only when the default it is not 
> desired). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12861) Changes to support KMeans with large feature space

Reply via email to