[ https://issues.apache.org/jira/browse/KYLIN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shaofeng SHI closed KYLIN-1326. ------------------------------- > Changes to support KMeans with large feature space > -------------------------------------------------- > > Key: KYLIN-1326 > URL: https://issues.apache.org/jira/browse/KYLIN-1326 > Project: Kylin > Issue Type: Improvement > Components: Spark > Reporter: Roy Levin > > The problem: > ----------------- > In Spark's KMeans code the center vectors are always represented as dense > vectors. As a result, when each such center has a large domain space the > algorithm quickly runs out of memory. In my example I have a feature space of > around 50000 and k ~= 500. This sums up to around 200MB RAM for the center > vectors alone while in fact the center vectors are very sparse and require a > lot less RAM. > Since I am running on a system with relatively low resources I keep getting > OutOfMemory errors. In my setting it is OK to trade off runtime for using > less RAM. This is what I set out to do in my solution while allowing users > the flexibility to choose. > One solution could be to reduce the dimensions of the feature space but this > is not always the best approach. For example, when the object space is > comprised of users and the feature space of items. In such an example we may > want to run kmeans over a feature space which is a function of how many times > user i clicked item j. If we reduce the dimensions of the items we will not > be able to map the centers vectors back to the items. Moreover in a streaming > context detecting the changes WRT previous runs gets more difficult. > My solution: > ---------------- > Allow the kmeans algorithm to accept a VectorFactory which decides when > vectors used inside the algorithm should be sparse and when they should be > dense. For backward compatibility the default behavior is to always make them > dense (like the situation is now). But now potentially the user can provide a > SmartVectorFactory (or some proprietary VectorFactory) which can decide to > make vectors sparse. > For this I made the following changes: > (1) Added a method called reassign to SparseVectors allowing to change the > indices and values > (2) Allow axpy to accept SparseVectors > (3) create a trait called VectorFactory and two implementations for it that > are used within KMeans code > To get the above described solution do the following: > git clone https://github.com/levin-royl/spark.git -b > SupportLargeFeatureDomains -- This message was sent by Atlassian JIRA (v6.4.14#64029)