Right now dimsum is meant to be used for tall and skinny matrices, and so
columnSimilarities() returns similar columns, not rows. We are working on
adding an efficient row similarity as well, tracked by this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823
Reza

On Mon, Apr 6, 2015 at 6:08 AM, James <alcaid1...@gmail.com> wrote:

> The example below illustrates how to use the DIMSUM algorithm to calculate
> the similarity between each two rows and output row pairs with cosine
> simiarity that is not less than a threshold.
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
>
>
> But what if I hope to hold an Id of each row, which means the input file
> is:
>
> id1 vector1
> id2 vector2
> id3 vector3
> ...
>
> And we hope to output
>
> id1 id2 sim(id1, id2)
> id1 id3 sim(id1, id3)
> ...
>
>
> Alcaid
>

Reply via email to