Product similarity with TF/IDF and Cosine similarity (DIMSUM)

Alan Prando Sat, 30 Jan 2016 13:30:14 -0800

Hi Folks!

I am trying to implement a spark job to calculate the similarity of my database 
products, using only name and descriptions.
I would like to use TF-IDF to represent my text data and cosine similarity to 
calculate all similarities.


My goal is, after job completes, get all similarities as a list. 
For example:
Prod1 = ((Prod2, 0.98), (Prod3, 0.88))
Prod2 = ((Prod1, 0.98), (Prod4, 0.53))
Prod3 = ((Prod1, 0.98))
Prod4 = ((Prod1, 0.53))

However, I am new with Spark and I am having issues to use understanding what 
cosine similarity returns!

My code:
    val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split(" 
").toSeq)

    val hashingTF = new HashingTF()
    val tf: RDD[Vector] = hashingTF.transform(documents)
    tf.cache()

    val idf = new IDF(minDocFreq = 2).fit(tf)
    val tfidf: RDD[Vector] = idf.transform(tf)

    val mat = new RowMatrix(tfidf)

    // Compute similar columns perfectly, with brute force.
    val exact = mat.columnSimilarities()

    // Compute similar columns with estimation using DIMSUM
    val approx = mat.columnSimilarities(0.1)

    val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, 
j), u) }
    val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, 
j), v) }

The file is just products name and description in each row.

The return I got:
    approxEntries.first()
    res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)

How can I figure out  what row this return is about?

Thanks in advance! =]



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Product similarity with TF/IDF and Cosine similarity (DIMSUM)

Reply via email to