Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
Thanks Derrick, when I count the unique terms it is very small. So I added
this...

val tfidf_features = lines.flatMap(x = x._2.split( ).filter(_.length 
2)).distinct().count().toInt
val hashingTF = new HashingTF(tfidf_features)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p22153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans with large clusters Java Heap Space

2015-01-30 Thread derrickburns
By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e.  a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster.  Hence, the memory needed grows as the number of clusters times 8M
bytes (8 bytes per double)

You should try to use my new   generalized kmeans clustering package
https://github.com/derrickburns/generalized-kmeans-clustering  , which
works on high dimensional sparse data.  

You will want to use the RandomIndexing embedding:

def sparseTrain(raw: RDD[Vector], k: Int): KMeansModel = {
KMeans.train(raw, k, embeddingNames = List(LOW_DIMENSIONAL_RI)
 }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p21437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KMeans with large clusters Java Heap Space

2015-01-29 Thread mvsundaresan
Trying to cluster small text msgs, using HashingTF and IDF with L2
Normalization. Data looks like this

id, msg
1, some text1
2, some more text2
3, sample text 3

Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3
hrs) for upto 20 clusters, but when I ask for 200 clusters getting Java Heap
Space error. Working with 3 nodes cluster with each 8 GB memory and 2 cores.
Played with different configuration, but no luck...

what am I missing any suggestions?

here is my code 

val sparkConf = new SparkConf().setMaster(spark://master:7077)
.setAppName(SparkKMeans)
.set(spark.executor.memory, 4192m)
.set(spark.storageLevel, MEMORY_AND_DISK)
.set(spark.driver.memory, 4192m)
.set(spark.default.parallelism, 200)
.set(spark.storage.blockManagerHeartBeatMs, 6)
.set(spark.akka.frameSize, 1000)

implicit val sc = new SparkContext(sparkConf)

val numClusters = 200
val numIterations = 20

val file = sc.textFile(.../file10k)
val lines = file.map(_.split(\001)).map(x = (x(1).toString,
x(15).toString))
val msgs = lines.map{case(val1, val2) =
(val2).toString.replaceAll([^a-zA-Z0-9],  ).toLowerCase.split(
).toSeq}
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(msgs)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val l2normalizer = new Normalizer()
val data = tfidf.map(x = l2normalizer.transform(x))

val clusters = KMeans.train(data, numClusters, numIterations)

val WSSSE = clusters.computeCost(data)
val centtroids = clusters.clusterCenters map (_.toArray)

val result = clusters.predict(data)
val srcidx = result.zipWithIndex().map{case(val1, val2) = (val2, val1)}
val tktidx = tickets.zipWithIndex().map{case((val1, val2), val3) = (val3,
(val1, val2))}
val joined = srcidx.join(tktidx).map{case(val1, (val2, (val3, val4))) =
(val1, val2, val3, val4)}
joined.saveAsTextFile(.../clustersoutput.txt)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org