Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
Thanks Derrick, when I count the unique terms it is very small. So I added
this...

val tfidf_features = lines.flatMap(x = x._2.split( ).filter(_.length 
2)).distinct().count().toInt
val hashingTF = new HashingTF(tfidf_features)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p22153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark MLLib KMeans Top Terms

2015-03-19 Thread mvsundaresan
I'm trying to cluster short text messages using KMeans, after trained the
kmeans I want to get the top terms (5 - 10). How do I get that using
clusterCenters?

full code is here
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-KMeans-Top-Terms-tp22154.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KMeans with large clusters Java Heap Space

2015-01-29 Thread mvsundaresan
Trying to cluster small text msgs, using HashingTF and IDF with L2
Normalization. Data looks like this

id, msg
1, some text1
2, some more text2
3, sample text 3

Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3
hrs) for upto 20 clusters, but when I ask for 200 clusters getting Java Heap
Space error. Working with 3 nodes cluster with each 8 GB memory and 2 cores.
Played with different configuration, but no luck...

what am I missing any suggestions?

here is my code 

val sparkConf = new SparkConf().setMaster(spark://master:7077)
.setAppName(SparkKMeans)
.set(spark.executor.memory, 4192m)
.set(spark.storageLevel, MEMORY_AND_DISK)
.set(spark.driver.memory, 4192m)
.set(spark.default.parallelism, 200)
.set(spark.storage.blockManagerHeartBeatMs, 6)
.set(spark.akka.frameSize, 1000)

implicit val sc = new SparkContext(sparkConf)

val numClusters = 200
val numIterations = 20

val file = sc.textFile(.../file10k)
val lines = file.map(_.split(\001)).map(x = (x(1).toString,
x(15).toString))
val msgs = lines.map{case(val1, val2) =
(val2).toString.replaceAll([^a-zA-Z0-9],  ).toLowerCase.split(
).toSeq}
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(msgs)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val l2normalizer = new Normalizer()
val data = tfidf.map(x = l2normalizer.transform(x))

val clusters = KMeans.train(data, numClusters, numIterations)

val WSSSE = clusters.computeCost(data)
val centtroids = clusters.clusterCenters map (_.toArray)

val result = clusters.predict(data)
val srcidx = result.zipWithIndex().map{case(val1, val2) = (val2, val1)}
val tktidx = tickets.zipWithIndex().map{case((val1, val2), val3) = (val3,
(val1, val2))}
val joined = srcidx.join(tktidx).map{case(val1, (val2, (val3, val4))) =
(val1, val2, val3, val4)}
joined.saveAsTextFile(.../clustersoutput.txt)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org