Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
Thank you all for these links. I'll check them.
On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
wrote:
+1 to all of the above esp. Dimensionality reduction and locality
sensitive hashing / min hashing.
There's also an algorithm implemented in MLlib called DIMSUM which
Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
1) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor based
You could try dimensionality reduction (PCA or SVD) first. I would imagine that
even if you could successfully compute similarities in the high-dimensional
space you would probably run into the curse of dimensionality.
On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote:
Dear
If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:
: Build k-NN graph for large dataset
If you don't want to compute all N^2 similarities, you need to implement some
kind of blocking first. For example, LSH (locally sensitive hashing). A quick
search gave this link to a Spark implementation:
http://stackoverflow.com/questions/2771/spark
+1 to all of the above esp. Dimensionality reduction and locality sensitive
hashing / min hashing.
There's also an algorithm implemented in MLlib called DIMSUM which was
developed at Twitter for this purpose. I've been meaning to try it and would be
interested to hear about results you get.