2014-03-14 11:23 GMT+01:00 Gael Varoquaux <[email protected]>: > On Fri, Mar 14, 2014 at 10:38:17AM +0100, Olivier Grisel wrote: >> The problem is that by having had a look at the ANN literature in the >> past, my gut feeling is that basic LSH is mostly worthless for >> practical applications. I would rather not have a method is >> scikit-learn that is not performing good enough (in terms of speed up >> vs exact brute force queries) to be useful on real data. > > Quite clearly. We are not interested in such contribution. Indeed, for my > applications, something like random projections would work much better > than LSH, as far as I have seen. > > So what are exactly the usecases and the usage patterns that this > proposal is addressing and for which we should see a big speedup?
It is my understanding that the canonical application of Approx Nearest Neighbors methods is for multi-media similarity search with dense, medium dimensional descriptor vectors. For instance find the top 100 similar images in a database of at least several millions images summarized by feature vectors ranging from 100s to 1000s of descriptors based on possibly pooled BoW encodings of histogram based local descriptors. Spotify is using ANN queries for a very different use case: they do matrix factorization of User-Item interaction matrix and perform similarity queries between items in the latent space (I would say in the order of 100 components). See the readme for slightly more details: https://github.com/spotify/annoy That sounds like a very useful application. We could try to build a benchmark dataset by extracting the top 100 SVD components of the movielens 10M matrix: http://grouplens.org/datasets/movielens/ but that represents only 72K users and 10K items so I am not sure that ANN queries would yield a speed up compared to brute force on such data. Another user case would be to implement k-Nearest Neighbors classification on datasets with a dimension high enough to render exact methods such as KD-tree and ball-tree inefficient (I would say 500+ features). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
