Re: TF-IDF in Spark 1.1.0

Xiangrui Meng Tue, 14 Oct 2014 14:01:46 -0700

You cannot recover the document from the TF-IDF vector, because
HashingTF is not reversible. You can assign each document a unique ID,
and join back the result after training. HasingTF can transform
individual record:


val docs: RDD[(String, Seq[String])] = ...

val tf = new HashingTF()
val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform)

...

Best,
Xiangrui

On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster <burke.webs...@gmail.com> wrote:
> I'm following the Mllib example for TF-IDF and ran into a problem due to my
> lack of knowledge of Scala and spark.  Any help would be greatly
> appreciated.
>
> Following the Mllib example I could do something like this:
>
> import org.apache.spark.rdd.RDD
> import org.apache.spark.SparkContext
> import org.apache.spark.mllib.feature.HashingTF
> import org.apache.spark.mllib.linalg.Vector
> import org.apache.spark.mllib.feature.IDF
>
> val sc: SparkContext = ...
> val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
>
> val hashingTF = new HashingTF()
> val tf: RDD[Vector] = hasingTF.transform(documents)
> tf.cache()
>
> val idf = new IDF().fit(tf)
> val tfidf: RDD[Vector] = idf.transform(tf)
>
> As a result I would have an RDD containing the TF-IDF vectors for the input
> documents.  My question is how do I map the vector back to the original
> input document?
>
> My end goal is to compute document similarity using cosine similarity.  From
> what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute
> the dot-product.  Has anybody done this?
>
> Currently, my example looks more like this:
>
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
> import org.apache.spark.mllib.feature.HashingTF
> import org.apache.spark.mllib.feature.IDF
> import org.apache.spark.mllib.linalg.Vector
> import org.apache.spark.rdd.RDD
> import org.apache.spark.SparkContext
>
> val sc: SparkContext = ...
>
> // input is sequence file of the form (docid: Text, content: Text)
> val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”)
>
> val docs: RDD[(String, Seq[String])] = data.mapValues(v => v.split("
> ").toSeq)
>
> val hashingTF = new HashingTF()
> val tf: RDD[(String, Vector)] = hashingTF.??
>
> I'm trying to maintain some linking from the document identifier to it's
> eventual vertex representation.  I'm I going about this incorrectly?
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TF-IDF in Spark 1.1.0

Reply via email to