I'm following the Mllib example for TF-IDF and ran into a problem due to my
lack of knowledge of Scala and spark.  Any help would be greatly
appreciated.

Following the Mllib example I could do something like this:

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF

val sc: SparkContext = ...
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hasingTF.transform(documents)
tf.cache()

val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

As a result I would have an RDD containing the TF-IDF vectors for the input
documents.  My question is how do I map the vector back to the original
input document?

My end goal is to compute document similarity using cosine similarity.
>From what I can tell, I can compute TF-IDF, apply the L2 norm, and then
compute the dot-product.  Has anybody done this?

Currently, my example looks more like this:

import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext

val sc: SparkContext = ...

// input is sequence file of the form (docid: Text, content: Text)
val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”)

val docs: RDD[(String, Seq[String])] = data.mapValues(v => v.split("
").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[(String, Vector)] = hashingTF.??

I'm trying to maintain some linking from the document identifier to it's
eventual vertex representation.  I'm I going about this incorrectly?

Thanks

Reply via email to