Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

nguyen duc tuan Tue, 17 May 2016 17:20:46 -0700

There's no *RowSimilarity *method in RowMatrix class. You have to transpose
your matrix to use that method. However, when the number of rows is large,
this approach is still very slow.
Try to use approximate nearest neighbor (ANN) methods instead such as LSH.
There are several implements of LSH on spark that you can find on github.
For example: https://github.com/karlhigley/spark-neighbors.


An other option, you can use ANN libraries on a single machine. There's a
good benchmark of ANN libraries here:
https://github.com/erikbern/ann-benchmarks

2016-05-17 23:24 GMT+07:00 Rex X <dnsr...@gmail.com>:

> Each row of the given matrix is Vector[Double]. Want to find out the
> nearest neighbor row to each row using cosine similarity.
>
> The problem here is the complexity: O( 10^20 )
>
> We need to do *blocking*, and do the row-wise comparison within each
> block. Any tips for best practice?
>
> In Spark, we have RowMatrix.*ColumnSimilarity*, but I didn't find
> *RowSimilarity* method.
>
>
> Thank you.
>
>
> Regards
> Rex
>
>
>
>

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

Reply via email to