The example below illustrates how to use the DIMSUM algorithm to calculate the similarity between each two rows and output row pairs with cosine simiarity that is not less than a threshold.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala But what if I hope to hold an Id of each row, which means the input file is: id1 vector1 id2 vector2 id3 vector3 ... And we hope to output id1 id2 sim(id1, id2) id1 id3 sim(id1, id3) ... Alcaid