Trying to cluster small text msgs, using HashingTF and IDF with L2 Normalization. Data looks like this
id, msg 1, some text1 2, some more text2 3, sample text 3 Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3 hrs) for upto 20 clusters, but when I ask for 200 clusters getting Java Heap Space error. Working with 3 nodes cluster with each 8 GB memory and 2 cores. Played with different configuration, but no luck... what am I missing any suggestions? here is my code val sparkConf = new SparkConf().setMaster("spark://master:7077") .setAppName("SparkKMeans") .set("spark.executor.memory", "4192m") .set("spark.storageLevel", "MEMORY_AND_DISK") .set("spark.driver.memory", "4192m") .set("spark.default.parallelism", "200") .set("spark.storage.blockManagerHeartBeatMs", "60000") .set("spark.akka.frameSize", "1000") implicit val sc = new SparkContext(sparkConf) val numClusters = 200 val numIterations = 20 val file = sc.textFile(".../file10k") val lines = file.map(_.split("\001")).map(x => (x(1).toString, x(15).toString)) val msgs = lines.map{case(val1, val2) => (val2).toString.replaceAll("[^a-zA-Z0-9]", " ").toLowerCase.split(" ").toSeq} val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(msgs) val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) val l2normalizer = new Normalizer() val data = tfidf.map(x => l2normalizer.transform(x)) val clusters = KMeans.train(data, numClusters, numIterations) val WSSSE = clusters.computeCost(data) val centtroids = clusters.clusterCenters map (_.toArray) val result = clusters.predict(data) val srcidx = result.zipWithIndex().map{case(val1, val2) => (val2, val1)} val tktidx = tickets.zipWithIndex().map{case((val1, val2), val3) => (val3, (val1, val2))} val joined = srcidx.join(tktidx).map{case(val1, (val2, (val3, val4))) => (val1, val2, val3, val4)} joined.saveAsTextFile(".../clustersoutput.txt") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org