Hi Folks! I am trying to implement a spark job to calculate the similarity of my database products, using only name and descriptions. I would like to use TF-IDF to represent my text data and cosine similarity to calculate all similarities.
My goal is, after job completes, get all similarities as a list. For example: Prod1 = ((Prod2, 0.98), (Prod3, 0.88)) Prod2 = ((Prod1, 0.98), (Prod4, 0.53)) Prod3 = ((Prod1, 0.98)) Prod4 = ((Prod1, 0.53)) However, I am new with Spark and I am having issues to use understanding what cosine similarity returns! My code: val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split(" ").toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) tf.cache() val idf = new IDF(minDocFreq = 2).fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) val mat = new RowMatrix(tfidf) // Compute similar columns perfectly, with brute force. val exact = mat.columnSimilarities() // Compute similar columns with estimation using DIMSUM val approx = mat.columnSimilarities(0.1) val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) } val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) } The file is just products name and description in each row. The return I got: approxEntries.first() res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676) How can I figure out what row this return is about? Thanks in advance! =] --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org