You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record:
val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform) ... Best, Xiangrui On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster <burke.webs...@gmail.com> wrote: > I'm following the Mllib example for TF-IDF and ran into a problem due to my > lack of knowledge of Scala and spark. Any help would be greatly > appreciated. > > Following the Mllib example I could do something like this: > > import org.apache.spark.rdd.RDD > import org.apache.spark.SparkContext > import org.apache.spark.mllib.feature.HashingTF > import org.apache.spark.mllib.linalg.Vector > import org.apache.spark.mllib.feature.IDF > > val sc: SparkContext = ... > val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq) > > val hashingTF = new HashingTF() > val tf: RDD[Vector] = hasingTF.transform(documents) > tf.cache() > > val idf = new IDF().fit(tf) > val tfidf: RDD[Vector] = idf.transform(tf) > > As a result I would have an RDD containing the TF-IDF vectors for the input > documents. My question is how do I map the vector back to the original > input document? > > My end goal is to compute document similarity using cosine similarity. From > what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute > the dot-product. Has anybody done this? > > Currently, my example looks more like this: > > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > import org.apache.spark.mllib.feature.HashingTF > import org.apache.spark.mllib.feature.IDF > import org.apache.spark.mllib.linalg.Vector > import org.apache.spark.rdd.RDD > import org.apache.spark.SparkContext > > val sc: SparkContext = ... > > // input is sequence file of the form (docid: Text, content: Text) > val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”) > > val docs: RDD[(String, Seq[String])] = data.mapValues(v => v.split(" > ").toSeq) > > val hashingTF = new HashingTF() > val tf: RDD[(String, Vector)] = hashingTF.?? > > I'm trying to maintain some linking from the document identifier to it's > eventual vertex representation. I'm I going about this incorrectly? > > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org