[ 
https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146464#comment-15146464
 ] 

yuhao yang commented on SPARK-12861:
------------------------------------

https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

I got an implementation there that supports sparse k-means centers. The 
calculation pattern can be switched via an extra parameter and users can choose 
which pattern to use. As expected, it can save a lot of memory according to the 
average sparsity of the cluster centers, but will consume much more time also.

For feature dimension of 10M and nonzero rate is 1e-6, it can reduce memory 
consumption by 40 times yet used 700% time. Welcome to use if you really need 
to support large dimension k-means. 

> Changes to support KMeans with large feature space
> --------------------------------------------------
>
>                 Key: SPARK-12861
>                 URL: https://issues.apache.org/jira/browse/SPARK-12861
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.6.0
>            Reporter: Roy Levin
>              Labels: patch
>
>     The problem:
>     -----------------
>     In Spark's KMeans code the center vectors are always represented as dense 
> vectors. As a result, when each such center has a large domain space the 
> algorithm quickly runs out of memory. In my example I have a feature space of 
> around 50000 and k ~= 500. This sums up to around 200MB RAM for the center 
> vectors alone while in fact the center vectors are very sparse and require a 
> lot less RAM.
>     Since I am running on a system with relatively low resources I keep 
> getting OutOfMemory errors. In my setting it is OK to trade off runtime for 
> using less RAM. This is what I set out to do in my solution while allowing 
> users the flexibility to choose.
>     One solution could be to reduce the dimensions of the feature space but 
> this is not always the best approach. For example, when the object space is 
> comprised of users and the feature space of items. In such an example we may 
> want to run kmeans over a feature space which is a function of how many times 
> user i clicked item j. If we reduce the dimensions of the items we will not 
> be able to map the centers vectors back to the items. Moreover in a streaming 
> context detecting the changes WRT previous runs gets more difficult.
>     My solution:
>     ----------------
>     Allow the kmeans algorithm to accept a VectorFactory which decides when 
> vectors used inside the algorithm should be sparse and when they should be 
> dense. For backward compatibility the default behavior is to always make them 
> dense (like the situation is now). But now potentially the user can provide a 
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to 
> make vectors sparse.
>     For this I made the following changes:
>     (1) Added a method called reassign to SparseVectors allowing to change 
> the indices and values
>     (2) Allow axpy to accept SparseVectors
>     (3) create a trait called VectorFactory and two implementations for it 
> that are used within KMeans code
>     To get the above described solution do the following:
>     git clone https://github.com/levin-royl/spark.git -b 
> SupportLargeFeatureDomains
> Note
> ------
> There are some similar issues opened in JIRA in the past, e.g.:
> https://issues.apache.org/jira/browse/SPARK-4039
> https://issues.apache.org/jira/browse/SPARK-1212
> https://github.com/mesos/spark/pull/736
> But the difference is that in the problem I describe reducing the dimensions 
> of the problem (i.e., the feature space) to allow using dense vectors is not 
> suitable. Also, the solution I implemented supports this while allowing full 
> flexibility to the user --- i.e., using the default dense vector 
> implementation or selecting an alternative (only when the default it is not 
> desired). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to