Hi Olivier,

Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe) 
paper? I've used it in practice, and it makes a significant improvement 
vs. basic LSH. I've proposed in recent emails to the list how to 
implement it based on sorting and binary search (in order to avoid the 
dependency/maintenance burden of range-queries).

I think evaluating and tweaking vanilla LSH is a waste of time, when an 
obvious improvement that fixes a significant real world flaw (control of 
number of candidates seen) in the method is available.

Daniel

On 03/14/2014 11:38 AM, Olivier Grisel wrote:
> 2014-03-13 23:15 GMT+01:00 Robert Layton <[email protected]>:
>> Thanks Gael. My thinking was to implement "Basic LSH with basic data
>> structures" and then spend some of the time working on seeing if moderate
>> improvements (i.e. a more complex data structure) can deliver benefits. This
>> way, we get the key deliverable, and spend some time trying to see if we can
>> do better.
>>
>> I'd also like to see scalability added to the evaluation criteria!
>
> The problem is that by having had a look at the ANN literature in the
> past, my gut feeling is that basic LSH is mostly worthless for
> practical applications. I would rather not have a method is
> scikit-learn that is not performing good enough (in terms of speed up
> vs exact brute force queries) to be useful on real data.
>
> To decide I think the best way to proceed would be to have a
> evaluation / prototyping stage in the project that would first hack an
> implementation in of the basic methods (at least random projections
> and maybe others) in a gist outside of the scikit-learn codebase, not
> to worry about documentation and compliance with the API and benchmark
> it / profile it on representative datasets with various statistical
> structures and compare the outcome to alternative methods such as the
> methods implemented in FLANN and Annoy.
>
> Actually there exists several Python implementation of vanilla LSH:
>
> https://pypi.python.org/pypi/lshash
> http://nearpy.io/
> https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py
>
> It would be interesting to benchmark them all on the same datasets
> (with various dimensionality, ranks and sparsity patterns).
>
> Note, that Radim already had a look at the yahoo LSH implementation
> and found it impractical to use:
>
> http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to