I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated.
Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.feature.IDF val sc: SparkContext = ... val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hasingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) As a result I would have an RDD containing the TF-IDF vectors for the input documents. My question is how do I map the vector back to the original input document? My end goal is to compute document similarity using cosine similarity. >From what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute the dot-product. Has anybody done this? Currently, my example looks more like this: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.feature.IDF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext val sc: SparkContext = ... // input is sequence file of the form (docid: Text, content: Text) val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”) val docs: RDD[(String, Seq[String])] = data.mapValues(v => v.split(" ").toSeq) val hashingTF = new HashingTF() val tf: RDD[(String, Vector)] = hashingTF.?? I'm trying to maintain some linking from the document identifier to it's eventual vertex representation. I'm I going about this incorrectly? Thanks