[ https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186360#comment-14186360 ]
Xiangrui Meng commented on SPARK-4039: -------------------------------------- Before we return the task result, we can check whether it is worth to compress it into sparse vectors. But this extra complexity may be unnecessary. If you use HashingTF, you can reduce the number of features, and then the data to k-means will be denser. There is a trade-off between computation/storage and quality. You can try different number of features and understand the trade-offs for your dataset. > KMeans support HashingTF vectors > -------------------------------- > > Key: SPARK-4039 > URL: https://issues.apache.org/jira/browse/SPARK-4039 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.1.0 > Reporter: Antoine Amend > > When the number of features is not known, it might be quite helpful to create > sparse vectors using HashingTF.transform. KMeans transforms centers vectors > to dense vectors > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307), > therefore leading to OutOfMemory (even with small k). > Any way to keep vectors sparse ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org