Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Olivier Grisel Fri, 14 Mar 2014 04:51:05 -0700

2014-03-14 11:23 GMT+01:00 Gael Varoquaux <[email protected]>:
> On Fri, Mar 14, 2014 at 10:38:17AM +0100, Olivier Grisel wrote:
>> The problem is that by having had a look at the ANN literature in the
>> past, my gut feeling is that basic LSH is mostly worthless for
>> practical applications. I would rather not have a method is
>> scikit-learn that is not performing good enough (in terms of speed up
>> vs exact brute force queries) to be useful on real data.
>
> Quite clearly. We are not interested in such contribution. Indeed, for my
> applications, something like random projections would work much better
> than LSH, as far as I have seen.
>
> So what are exactly the usecases and the usage patterns that this
> proposal is addressing and for which we should see a big speedup?


It is my understanding that the canonical application of Approx
Nearest Neighbors methods is for multi-media similarity search with
dense, medium dimensional descriptor vectors. For instance find the
top 100 similar images in a database of at least several millions
images summarized by feature vectors ranging from 100s to 1000s of
descriptors based on possibly pooled BoW encodings of histogram based
local descriptors.

Spotify is using ANN queries for a very different use case: they do
matrix factorization of User-Item interaction matrix and perform
similarity queries between items in the latent space (I would say in
the order of 100 components).

See the readme for slightly more details: https://github.com/spotify/annoy

That sounds like a very useful application. We could try to build a
benchmark dataset by extracting the top 100 SVD components of the
movielens 10M matrix:

http://grouplens.org/datasets/movielens/

but that represents only 72K users and 10K items so I am not sure that
ANN queries would yield a speed up compared to brute force on such
data.

Another user case would be to implement k-Nearest Neighbors
classification on datasets with a dimension high enough to render
exact methods such as KD-tree and ball-tree inefficient (I would say
500+ features).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to