My 0.02 cents; not a scikit learn contributor but have implemented this before.
First, its 138 citations on an important but simple datastructure/query optimization: replace hashed maps with ordered maps for the range queries, and use those to get the right number of candidates, instead of too many or too few depending on local density. The problem of variable density that it solves was pretty serious in my application, and seems pretty generic. The main risk is in trying too hard to optimize the ordered map for efficiency and compactness. A Judy array would probably be ideal, but its C/C++ code with a patent in the background, and another B-Tree might be a hassle as a dependency or maintenance. Fortunately, sorted arrays and binary search are already very efficient ordered maps (as long as batched construction is acceptable), and something more specialized can be swapped in later if justified. This avoids the main risk completely for this GSOC at low cost. Other than that, the construction code is the same as in LSH, and the query code is novel but well localized. Maybe test code for some of the other NN search methods can be reused. Daniel Vainsencher On 03/11/2014 04:17 AM, Gael Varoquaux wrote: > On Sun, Mar 09, 2014 at 08:46:32PM +1100, Robert Layton wrote: >> That said, you'll need to make the case to the community -- the Forest >> LSH is a variation of a key algorithm. LSH by itself is definitely >> worthy of inclusion in scikit-learn, and it looks like Forest LSH would >> be sufficiently well-used (Google tells me 138 citations), but I think >> we need the community's decision. > > 138 citations on Google scholar isn't that much given the current > standards of inclusion in scikit-learn. I have probably missed the > discussion, but can someone make a compelling case for Forest LSH in a > few lines and evaluate the cost to benefit ratio of implementing it > (factoring in risk analysis and maintenance cost). > > Cheers, > > Gaƫl > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
