post in wrong project?

On Sun, Jan 17, 2016 at 2:58 PM, Roy Levin (JIRA) <[email protected]> wrote:

> Roy Levin created KYLIN-1326:
> --------------------------------
>
>              Summary: Changes to support KMeans with large feature space
>                  Key: KYLIN-1326
>                  URL: https://issues.apache.org/jira/browse/KYLIN-1326
>              Project: Kylin
>           Issue Type: Improvement
>           Components: Spark
>             Reporter: Roy Levin
>
>
> The problem:
> -----------------
> In Spark's KMeans code the center vectors are always represented as dense
> vectors. As a result, when each such center has a large domain space the
> algorithm quickly runs out of memory. In my example I have a feature space
> of around 50000 and k ~= 500. This sums up to around 200MB RAM for the
> center vectors alone while in fact the center vectors are very sparse and
> require a lot less RAM.
> Since I am running on a system with relatively low resources I keep
> getting OutOfMemory errors. In my setting it is OK to trade off runtime for
> using less RAM. This is what I set out to do in my solution while allowing
> users the flexibility to choose.
>
> One solution could be to reduce the dimensions of the feature space but
> this is not always the best approach. For example, when the object space is
> comprised of users and the feature space of items. In such an example we
> may want to run kmeans over a feature space which is a function of how many
> times user i clicked item j. If we reduce the dimensions of the items we
> will not be able to map the centers vectors back to the items. Moreover in
> a streaming context detecting the changes WRT previous runs gets more
> difficult.
>
>
> My solution:
> ----------------
> Allow the kmeans algorithm to accept a VectorFactory which decides when
> vectors used inside the algorithm should be sparse and when they should be
> dense. For backward compatibility the default behavior is to always make
> them dense (like the situation is now). But now potentially the user can
> provide a SmartVectorFactory (or some proprietary VectorFactory) which can
> decide to make vectors sparse.
>
> For this I made the following changes:
> (1) Added a method called reassign to SparseVectors allowing to change the
> indices and values
> (2) Allow axpy to accept SparseVectors
> (3) create a trait called VectorFactory and two implementations for it
> that are used within KMeans code
>
>
> To get the above described solution do the following:
>
> git clone https://github.com/levin-royl/spark.git -b
> SupportLargeFeatureDomains
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Reply via email to