My 0.02 cents; not a scikit learn contributor but have implemented this 
before.

First, its 138 citations on an important but simple datastructure/query 
optimization: replace hashed maps with ordered maps for the range 
queries, and use those to get the right number of candidates, instead of 
too many or too few depending on local density. The problem of variable 
density that it solves was pretty serious in my application, and seems 
pretty generic.

The main risk is in trying too hard to optimize the ordered map for 
efficiency and compactness. A Judy array would probably be ideal, but 
its C/C++ code with a patent in the background, and another B-Tree might 
be a hassle as a dependency or maintenance.

Fortunately, sorted arrays and binary search are already very efficient 
ordered maps (as long as batched construction is acceptable), and 
something more specialized can be swapped in later if justified. This 
avoids the main risk completely for this GSOC at low cost.

Other than that, the construction code is the same as in LSH, and the 
query code is novel but well localized. Maybe test code for some of the 
other NN search methods can be reused.

Daniel Vainsencher

On 03/11/2014 04:17 AM, Gael Varoquaux wrote:
> On Sun, Mar 09, 2014 at 08:46:32PM +1100, Robert Layton wrote:
>> That said, you'll need to make the case to the community -- the Forest
>> LSH is a variation of a key algorithm. LSH by itself is definitely
>> worthy of inclusion in scikit-learn, and it looks like Forest LSH would
>> be sufficiently well-used (Google tells me 138 citations), but I think
>> we need the community's decision.
>
> 138 citations on Google scholar isn't that much given the current
> standards of inclusion in scikit-learn. I have probably missed the
> discussion, but can someone make a compelling case for Forest LSH in a
> few lines and evaluate the cost to benefit ratio of implementing it
> (factoring in risk analysis and maintenance cost).
>
> Cheers,
>
> Gaƫl
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to