Re: Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-02-03 Thread Karl Higley
Hi Alan,

I'm slow responding, so you may have already figured this out. Just in
case, though:

val approx = mat.columnSimilarities(0.1)
approxEntries.first()
res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)

The above is returning the cosine similarity between columns 1638 and
966248. Since you're providing documents as rows, this is conceptually
something like the similarity between terms based on which documents they
occur in.

In order to get the similarity between documents based on the terms they
contain, you'd need to build a RowMatrix where each row represents one term
and each column represents one document. One way to do that would be to
construct a CoordinateMatrix from your vectors, call transpose() on it,
then convert it to a RowMatrix via toRowMatrix().

Hope that helps!

Best,
Karl

On Sat, Jan 30, 2016 at 4:30 PM Alan Prando  wrote:

> Hi Folks!
>
> I am trying to implement a spark job to calculate the similarity of my
> database products, using only name and descriptions.
> I would like to use TF-IDF to represent my text data and cosine similarity
> to calculate all similarities.
>
> My goal is, after job completes, get all similarities as a list.
> For example:
> Prod1 = ((Prod2, 0.98), (Prod3, 0.88))
> Prod2 = ((Prod1, 0.98), (Prod4, 0.53))
> Prod3 = ((Prod1, 0.98))
> Prod4 = ((Prod1, 0.53))
>
> However, I am new with Spark and I am having issues to use understanding
> what cosine similarity returns!
>
> My code:
> val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split("
> ").toSeq)
>
> val hashingTF = new HashingTF()
> val tf: RDD[Vector] = hashingTF.transform(documents)
> tf.cache()
>
> val idf = new IDF(minDocFreq = 2).fit(tf)
> val tfidf: RDD[Vector] = idf.transform(tf)
>
> val mat = new RowMatrix(tfidf)
>
> // Compute similar columns perfectly, with brute force.
> val exact = mat.columnSimilarities()
>
> // Compute similar columns with estimation using DIMSUM
> val approx = mat.columnSimilarities(0.1)
>
> val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) =>
> ((i, j), u) }
> val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) =>
> ((i, j), v) }
>
> The file is just products name and description in each row.
>
> The return I got:
> approxEntries.first()
> res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)
>
> How can I figure out  what row this return is about?
>
> Thanks in advance! =]
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-01-30 Thread Alan Prando
Hi Folks!

I am trying to implement a spark job to calculate the similarity of my database 
products, using only name and descriptions.
I would like to use TF-IDF to represent my text data and cosine similarity to 
calculate all similarities.

My goal is, after job completes, get all similarities as a list. 
For example:
Prod1 = ((Prod2, 0.98), (Prod3, 0.88))
Prod2 = ((Prod1, 0.98), (Prod4, 0.53))
Prod3 = ((Prod1, 0.98))
Prod4 = ((Prod1, 0.53))

However, I am new with Spark and I am having issues to use understanding what 
cosine similarity returns!

My code:
val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split(" 
").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()

val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

val mat = new RowMatrix(tfidf)

// Compute similar columns perfectly, with brute force.
val exact = mat.columnSimilarities()

// Compute similar columns with estimation using DIMSUM
val approx = mat.columnSimilarities(0.1)

val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, 
j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, 
j), v) }

The file is just products name and description in each row.

The return I got:
approxEntries.first()
res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)

How can I figure out  what row this return is about?

Thanks in advance! =]



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org