[ https://issues.apache.org/jira/browse/SPARK-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-12861: ------------------------------ Affects Version/s: (was: 1.6.1) 1.6.0 Target Version/s: (was: 1.6.1) Fix Version/s: (was: 1.6.1) Component/s: ML [~levin.r...@gmail.com] don't set target (for committers), fix version (it's not resolved) > Changes to support KMeans with large feature space > -------------------------------------------------- > > Key: SPARK-12861 > URL: https://issues.apache.org/jira/browse/SPARK-12861 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 1.6.0 > Reporter: Roy Levin > Labels: patch > > The problem: > ----------------- > In Spark's KMeans code the center vectors are always represented as dense > vectors. As a result, when each such center has a large domain space the > algorithm quickly runs out of memory. In my example I have a feature space of > around 50000 and k ~= 500. This sums up to around 200MB RAM for the center > vectors alone while in fact the center vectors are very sparse and require a > lot less RAM. > Since I am running on a system with relatively low resources I keep > getting OutOfMemory errors. In my setting it is OK to trade off runtime for > using less RAM. This is what I set out to do in my solution while allowing > users the flexibility to choose. > One solution could be to reduce the dimensions of the feature space but > this is not always the best approach. For example, when the object space is > comprised of users and the feature space of items. In such an example we may > want to run kmeans over a feature space which is a function of how many times > user i clicked item j. If we reduce the dimensions of the items we will not > be able to map the centers vectors back to the items. Moreover in a streaming > context detecting the changes WRT previous runs gets more difficult. > My solution: > ---------------- > Allow the kmeans algorithm to accept a VectorFactory which decides when > vectors used inside the algorithm should be sparse and when they should be > dense. For backward compatibility the default behavior is to always make them > dense (like the situation is now). But now potentially the user can provide a > SmartVectorFactory (or some proprietary VectorFactory) which can decide to > make vectors sparse. > For this I made the following changes: > (1) Added a method called reassign to SparseVectors allowing to change > the indices and values > (2) Allow axpy to accept SparseVectors > (3) create a trait called VectorFactory and two implementations for it > that are used within KMeans code > To get the above described solution do the following: > git clone https://github.com/levin-royl/spark.git -b > SupportLargeFeatureDomains -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org