Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation?
val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable") val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split(" ").toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) It looks like I can get the indices of the terms using something like J = wordListRDD.map(w => hashingTF.indexOf(w)) where wordList is an RDD holding the distinct words from the sequence of words used to come up with tf. But how do I do the equivalent of Counts = J.map(j => tf.counts(j)) ? Thanks, Ron