Roy Levin created KYLIN-1326:
--------------------------------

             Summary: Changes to support KMeans with large feature space
                 Key: KYLIN-1326
                 URL: https://issues.apache.org/jira/browse/KYLIN-1326
             Project: Kylin
          Issue Type: Improvement
          Components: Spark
            Reporter: Roy Levin


The problem:
-----------------
In Spark's KMeans code the center vectors are always represented as dense 
vectors. As a result, when each such center has a large domain space the 
algorithm quickly runs out of memory. In my example I have a feature space of 
around 50000 and k ~= 500. This sums up to around 200MB RAM for the center 
vectors alone while in fact the center vectors are very sparse and require a 
lot less RAM.
Since I am running on a system with relatively low resources I keep getting 
OutOfMemory errors. In my setting it is OK to trade off runtime for using less 
RAM. This is what I set out to do in my solution while allowing users the 
flexibility to choose.

One solution could be to reduce the dimensions of the feature space but this is 
not always the best approach. For example, when the object space is comprised 
of users and the feature space of items. In such an example we may want to run 
kmeans over a feature space which is a function of how many times user i 
clicked item j. If we reduce the dimensions of the items we will not be able to 
map the centers vectors back to the items. Moreover in a streaming context 
detecting the changes WRT previous runs gets more difficult.


My solution:
----------------
Allow the kmeans algorithm to accept a VectorFactory which decides when vectors 
used inside the algorithm should be sparse and when they should be dense. For 
backward compatibility the default behavior is to always make them dense (like 
the situation is now). But now potentially the user can provide a 
SmartVectorFactory (or some proprietary VectorFactory) which can decide to make 
vectors sparse.

For this I made the following changes:
(1) Added a method called reassign to SparseVectors allowing to change the 
indices and values
(2) Allow axpy to accept SparseVectors
(3) create a trait called VectorFactory and two implementations for it that are 
used within KMeans code


To get the above described solution do the following:

git clone https://github.com/levin-royl/spark.git -b SupportLargeFeatureDomains




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to