Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Daniel Vainsencher Wed, 12 Mar 2014 08:00:07 -0700

Yes, at least the parts of ordered maps critical to implementing and 
testing LSH-Forest. Removal would be expensive, growing on line also.


But the usual usage of:
1. Index a million points
2. Run 10 million ANN queries

would be served very well by numpy.arrays (with dtype=int). Sorting and 
binary search there are probably already optimized to death, I wouldn't 
spend time on it. Then you can implement and test LSH-Forest itself 
relatively quickly, leaving time to solve problems like "how many 
candidates should be considered" etc.

Daniel

On 03/12/2014 03:52 PM, Maheshakya Wijewardena wrote:
> Hi Daniel,
>
> You are suggesting that current numpy arrays and implementation of a
> binary search can be used to create the functionality of an ordered map
> successfully, right? With this structure, trees for LSH forest can be
> built. Or we can go for a Cython implementation as well.
>
> Gaël, What is the status of the evaluation of the cost to benefit ratio
> for this? I would appreciate it if we can decide on storing structure
> for the LSH implementation quickly as I have to make a sound plan for
> this project.
>
> Best Regards,
> Maheshakya.
>
>
> On Tue, Mar 11, 2014 at 1:47 PM, Daniel Vainsencher
> <[email protected] <mailto:[email protected]>> wrote:
>
>     My 0.02 cents; not a scikit learn contributor but have implemented this
>     before.
>
>     First, its 138 citations on an important but simple datastructure/query
>     optimization: replace hashed maps with ordered maps for the range
>     queries, and use those to get the right number of candidates, instead of
>     too many or too few depending on local density. The problem of variable
>     density that it solves was pretty serious in my application, and seems
>     pretty generic.
>
>     The main risk is in trying too hard to optimize the ordered map for
>     efficiency and compactness. A Judy array would probably be ideal, but
>     its C/C++ code with a patent in the background, and another B-Tree might
>     be a hassle as a dependency or maintenance.
>
>     Fortunately, sorted arrays and binary search are already very efficient
>     ordered maps (as long as batched construction is acceptable), and
>     something more specialized can be swapped in later if justified. This
>     avoids the main risk completely for this GSOC at low cost.
>
>     Other than that, the construction code is the same as in LSH, and the
>     query code is novel but well localized. Maybe test code for some of the
>     other NN search methods can be reused.
>
>     Daniel Vainsencher
>
>     On 03/11/2014 04:17 AM, Gael Varoquaux wrote:
>      > On Sun, Mar 09, 2014 at 08:46:32PM +1100, Robert Layton wrote:
>      >> That said, you'll need to make the case to the community -- the
>     Forest
>      >> LSH is a variation of a key algorithm. LSH by itself is definitely
>      >> worthy of inclusion in scikit-learn, and it looks like Forest
>     LSH would
>      >> be sufficiently well-used (Google tells me 138 citations), but I
>     think
>      >> we need the community's decision.
>      >
>      > 138 citations on Google scholar isn't that much given the current
>      > standards of inclusion in scikit-learn. I have probably missed the
>      > discussion, but can someone make a compelling case for Forest LSH
>     in a
>      > few lines and evaluate the cost to benefit ratio of implementing it
>      > (factoring in risk analysis and maintenance cost).
>      >
>      > Cheers,
>      >
>      > Gaël
>      >
>      >
>     
> ------------------------------------------------------------------------------
>      > Learn Graph Databases - Download FREE O'Reilly Book
>      > "Graph Databases" is the definitive new guide to graph databases
>     and their
>      > applications. Written by three acclaimed leaders in the field,
>      > this first edition is now available. Download your free book today!
>      > http://p.sf.net/sfu/13534_NeoTech
>      > _______________________________________________
>      > Scikit-learn-general mailing list
>      > [email protected]
>     <mailto:[email protected]>
>      > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>      >
>
>
>     
> ------------------------------------------------------------------------------
>     Learn Graph Databases - Download FREE O'Reilly Book
>     "Graph Databases" is the definitive new guide to graph databases and
>     their
>     applications. Written by three acclaimed leaders in the field,
>     this first edition is now available. Download your free book today!
>     http://p.sf.net/sfu/13534_NeoTech
>     _______________________________________________
>     Scikit-learn-general mailing list
>     [email protected]
>     <mailto:[email protected]>
>     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
>
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to