[ https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121428#comment-15121428 ]
Roy Levin commented on SPARK-12861: ----------------------------------- Please note that I also explain the difference and why the proposed solution to 4039 (reducing the dimensions) cannot work in the case I describe. So I do not think this is a duplicate. In any case, if the issue is closed what does it mean WRT to the changes in the code I implemented? > Changes to support KMeans with large feature space > -------------------------------------------------- > > Key: SPARK-12861 > URL: https://issues.apache.org/jira/browse/SPARK-12861 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 1.6.0 > Reporter: Roy Levin > Labels: patch > > The problem: > ----------------- > In Spark's KMeans code the center vectors are always represented as dense > vectors. As a result, when each such center has a large domain space the > algorithm quickly runs out of memory. In my example I have a feature space of > around 50000 and k ~= 500. This sums up to around 200MB RAM for the center > vectors alone while in fact the center vectors are very sparse and require a > lot less RAM. > Since I am running on a system with relatively low resources I keep > getting OutOfMemory errors. In my setting it is OK to trade off runtime for > using less RAM. This is what I set out to do in my solution while allowing > users the flexibility to choose. > One solution could be to reduce the dimensions of the feature space but > this is not always the best approach. For example, when the object space is > comprised of users and the feature space of items. In such an example we may > want to run kmeans over a feature space which is a function of how many times > user i clicked item j. If we reduce the dimensions of the items we will not > be able to map the centers vectors back to the items. Moreover in a streaming > context detecting the changes WRT previous runs gets more difficult. > My solution: > ---------------- > Allow the kmeans algorithm to accept a VectorFactory which decides when > vectors used inside the algorithm should be sparse and when they should be > dense. For backward compatibility the default behavior is to always make them > dense (like the situation is now). But now potentially the user can provide a > SmartVectorFactory (or some proprietary VectorFactory) which can decide to > make vectors sparse. > For this I made the following changes: > (1) Added a method called reassign to SparseVectors allowing to change > the indices and values > (2) Allow axpy to accept SparseVectors > (3) create a trait called VectorFactory and two implementations for it > that are used within KMeans code > To get the above described solution do the following: > git clone https://github.com/levin-royl/spark.git -b > SupportLargeFeatureDomains > Note > ------ > There are some similar issues opened in JIRA in the past, e.g.: > https://issues.apache.org/jira/browse/SPARK-4039 > https://issues.apache.org/jira/browse/SPARK-1212 > https://github.com/mesos/spark/pull/736 > But the difference is that in the problem I describe reducing the dimensions > of the problem (i.e., the feature space) to allow using dense vectors is not > suitable. Also, the solution I implemented supports this while allowing full > flexibility to the user --- i.e., using the default dense vector > implementation or selecting an alternative (only when the default it is not > desired). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org