Re: Using DIMSUM with ids

Reza Zadeh Mon, 06 Apr 2015 10:16:16 -0700

Right now dimsum is meant to be used for tall and skinny matrices, and so
columnSimilarities() returns similar columns, not rows. We are working on
adding an efficient row similarity as well, tracked by this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823
Reza


On Mon, Apr 6, 2015 at 6:08 AM, James <alcaid1...@gmail.com> wrote:

> The example below illustrates how to use the DIMSUM algorithm to calculate
> the similarity between each two rows and output row pairs with cosine
> simiarity that is not less than a threshold.
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
>
>
> But what if I hope to hold an Id of each row, which means the input file
> is:
>
> id1 vector1
> id2 vector2
> id3 vector3
> ...
>
> And we hope to output
>
> id1 id2 sim(id1, id2)
> id1 id3 sim(id1, id3)
> ...
>
>
> Alcaid
>

Re: Using DIMSUM with ids

Reply via email to