The example below illustrates how to use the DIMSUM algorithm to calculate
the similarity between each two rows and output row pairs with cosine
simiarity that is not less than a threshold.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala


But what if I hope to hold an Id of each row, which means the input file
is:

id1 vector1
id2 vector2
id3 vector3
...

And we hope to output

id1 id2 sim(id1, id2)
id1 id3 sim(id1, id3)
...


Alcaid

Reply via email to