[ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186360#comment-14186360
 ] 

Xiangrui Meng commented on SPARK-4039:
--------------------------------------

Before we return the task result, we can check whether it is worth to compress 
it into sparse vectors. But this extra complexity may be unnecessary. If you 
use HashingTF, you can reduce the number of features, and then the data to 
k-means will be denser. There is a trade-off between computation/storage and 
quality. You can try different number of features and understand the trade-offs 
for your dataset.

> KMeans support HashingTF vectors
> --------------------------------
>
>                 Key: SPARK-4039
>                 URL: https://issues.apache.org/jira/browse/SPARK-4039
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.1.0
>            Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to