By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e.  a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster.  Hence, the memory needed grows as the number of clusters times 8M
bytes (8 bytes per double)....

You should try to use my new   generalized kmeans clustering package
<https://github.com/derrickburns/generalized-kmeans-clustering>  , which
works on high dimensional sparse data.  

You will want to use the RandomIndexing embedding:

def sparseTrain(raw: RDD[Vector], k: Int): KMeansModel = {
    KMeans.train(raw, k, embeddingNames = List(LOW_DIMENSIONAL_RI)
 }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p21437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to