Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
http://www.cs.
Thank you all for these links. I'll check them.
On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack
wrote:
> +1 to all of the above esp. Dimensionality reduction and locality
> sensitive hashing / min hashing.
>
> There's also an algorithm implemented in MLlib called DIMSUM which was
> developed at T
+1 to all of the above esp. Dimensionality reduction and locality sensitive
hashing / min hashing.
There's also an algorithm implemented in MLlib called DIMSUM which was
developed at Twitter for this purpose. I've been meaning to try it and would be
interested to hear about results you get.
7, which should
become available in the MEAP mid-September.
http://www.manning.com/books/spark-graphx-in-action
From: Kristina Rogale Plazonic
To: Jaonary Rabarisoa
Cc: user
Sent: Wednesday, August 26, 2015 7:24 AM
Subject: Re: Build k-NN graph for large dataset
If you don
If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:
http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashi
You could try dimensionality reduction (PCA or SVD) first. I would imagine that
even if you could successfully compute similarities in the high-dimensional
space you would probably run into the curse of dimensionality.
> On 26 Aug 2015, at 12:35, Jaonary Rabarisoa wrote:
>
> Dear all,
>
> I'm
Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
>>> 1) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor base