Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Daniel Vainsencher Tue, 18 Mar 2014 13:18:28 -0700

Hi Maheshakya,

I've read your proposal. Remember that evaluating doesn't bring home the 
bacon: you want an effective ANN solution that scikit learn can actually 
integrate. By all accounts, plain LSH is not competitive. Then you 
should evaluate it only to the extent it requires almost no extra 
effort. Since (to my understanding) scikit-learn prefers not to take on 
any dependencies, and prefers not to maintain complex tree based data 
structures, you should be wary where you spend implementation time. Do 
you have a practical plan for implementing any promising variant other 
than LSH-Forest?


Concretely, I would implement LSHForest first; create a variant query 
method that uses constant k (thus implementing plain LSH). Connect to 1 
or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?), 
and evaluate variations comparing to them. An implementation in 
scikit-learn that is in the ballpark of any of those would be an 
excellent achievement in my opinion for a GSOC. Is there benchmarking 
code that you can adopt into a scikit learn demo/benchmark?

Remember to allocate time to tweaking the query code: there are a couple 
of different algorithms in the LSH F paper, and many variants can be 
invented.

Daniel

On 03/18/2014 04:47 PM, Maheshakya Wijewardena wrote:
> Thank you Robert for your valuable feedback. I will update my proposal
> accordingly asap and let you know.
>
> However, I prepared this according to the template given by Python
> organization. So I have put information about the projects I have done
> in that position according to template. If scikit-learn expects it to be
> otherwise, I can change it.
>
> I have another thing to discuss here before finalizing my proposal.
> Mr. Wei Dong has informed me that there is another state-of-art approach
> he implemented to perform ANN search and asked to see if that can be
> used in the evaluating stage. This implementation is called KGraph
> <http://www.kgraph.org/index.php?n=Main.Python>. It has comparisons with
> FLANN, I have started reading it. Maybe you can have a look into that as
> well.
>
> Best regards,
> Maheshakya
>
>
> On Tue, Mar 18, 2014 at 4:08 AM, Robert Layton <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Thanks for sending this.
>
>     The grammar needs updating, but I'll let you worry about
>     proofreading. Comments below:
>     - State some /evidence/ for your experience with python, i.e.
>     projects you worked on or links to big commits (this is covered late
>     in the application, move some information forward).
>     - Don't choose a project because "it was on the list" -- why is the
>     problem important and what can *you* do to help?
>     - Be more confident! "This is the most vital part of this project as
>     I think." Ensure the whole project is planned and evaluated.
>     - Can you talk a bit more about which hashing algorithms will be
>     implemented? This is a major part of the work, but isn't
>     appropriately covered.
>     - In your abstract, I would make certain your plan -- you want to
>     implement prototypes, evaluate them and then finish the
>     implementations. Make this absolutely clear (currently, the reader
>     has to piece this together for themselves)
>     - References need to be done more thoroughly.
>
>     I think it's a good project, now you need to sell it!
>
>
>
>
>     On 18 March 2014 06:55, Maheshakya Wijewardena
>     <[email protected] <mailto:[email protected]>> wrote:
>
>         Hi,
>
>         I have written a proposal for this project and it is in the wiki,
>         
> https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search
>
>         It is not complete yet. I have not written the timeline yet. But
>         I have described my plan for this project. I think the
>         prototyping part of this project should be started during the
>         community bonding period, while fixing other issues because that
>         needs to be done as soon as possible. I will update this with my
>         timeline soon.
>
>         Can you review this and give me your feedback.
>
>         Thank you,
>         Maheshakya.
>
>
>         On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher
>         <[email protected]
>         <mailto:[email protected]>> wrote:
>
>             On 03/16/2014 08:59 PM, Olivier Grisel wrote:
>              > 2014-03-14 19:40 GMT+01:00 Daniel Vainsencher
>             <[email protected]
>             <mailto:[email protected]>>:
>              >> Hi Olivier,
>              >>
>              >> Have you looked at the LSH-Forest (and to a lesser
>             extent, Multiprobe)
>              >> paper? I've used it in practice, and it makes a
>             significant improvement
>              >> vs. basic LSH. I've proposed in recent emails to the
>             list how to
>              >> implement it based on sorting and binary search (in
>             order to avoid the
>              >> dependency/maintenance burden of range-queries).
>              >
>              > No I did not read those papers (yet).
>             I don't know the culture here, but since you seemed to
>             summarize rather
>             than wait for someone to read up on LSH:
>             - Where LSH uses hash-maps keyed on the a random hash of
>             length exactly
>             k, LSHF uses an ordered map (binary tree or BTree in the
>             paper). This
>             enables queries of form:
>             "all keys whose prefix of length h<=k matches this hash".
>             This trivially
>             enables a per tree query of type "give me the best q hashes
>             (and maybe a
>             few more)" by doing queries with h=k, h=k-1,... until
>             satisfied (this
>             can be optimized in different ways).
>             - Then queries across multiple trees can be combined in
>             different ways,
>             the simplest being "give me at least q/L candidates from
>             each tree, take
>             the union". More complex variants don't seem to make much
>             difference.
>             - So, the parameter k becomes essentially per-query, L can
>             be determined
>             on its own, and we can determine the number of expensive
>             "real distance"
>             comparisons we want to pay per query.
>
>             I offer the simple observation that sorted arrays support
>             very fast
>             range queries, so that complex/expensive trees are not
>             necessary for an
>             LSHF implementation.
>
>             As Maheshakya said it might very
>              > well be the method implemented in Annoy.
>             It seems similar but distinct, see a recent mail.
>
>              > Do you have sample python
>              > code around?
>             No, I implemented it in C#, and source is not available.
>
>              >> I think evaluating and tweaking vanilla LSH is a waste
>             of time, when an
>              >> obvious improvement that fixes a significant real world
>             flaw (control of
>              >> number of candidates seen) in the method is available.
>              > If the obvious improvement does not come with a
>             significant increment
>              > in code complexity (and maintenance burden) then I agree,
>             yes.
>             So, was the above was sufficient to answer your prerequisite?
>
>              >Getting rid of hyperparameters to select manually of via
>             grid-search
>              > is a big usability improvement and can make the method
>             much more
>              > practical to use.
>             In this case, performance becomes better than mere tuning
>             would allow,
>             since it adapts to database density around the query point.
>
>             Daniel
>
>             
> ------------------------------------------------------------------------------
>             Learn Graph Databases - Download FREE O'Reilly Book
>             "Graph Databases" is the definitive new guide to graph
>             databases and their
>             applications. Written by three acclaimed leaders in the field,
>             this first edition is now available. Download your free book
>             today!
>             http://p.sf.net/sfu/13534_NeoTech
>             _______________________________________________
>             Scikit-learn-general mailing list
>             [email protected]
>             <mailto:[email protected]>
>             https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
>         --
>         Undergraduate,
>         Department of Computer Science and Engineering,
>         Faculty of Engineering.
>         University of Moratuwa,
>         Sri Lanka
>
>         
> ------------------------------------------------------------------------------
>         Learn Graph Databases - Download FREE O'Reilly Book
>         "Graph Databases" is the definitive new guide to graph databases
>         and their
>         applications. Written by three acclaimed leaders in the field,
>         this first edition is now available. Download your free book today!
>         http://p.sf.net/sfu/13534_NeoTech
>         _______________________________________________
>         Scikit-learn-general mailing list
>         [email protected]
>         <mailto:[email protected]>
>         https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>     
> ------------------------------------------------------------------------------
>     Learn Graph Databases - Download FREE O'Reilly Book
>     "Graph Databases" is the definitive new guide to graph databases and
>     their
>     applications. Written by three acclaimed leaders in the field,
>     this first edition is now available. Download your free book today!
>     http://p.sf.net/sfu/13534_NeoTech
>     _______________________________________________
>     Scikit-learn-general mailing list
>     [email protected]
>     <mailto:[email protected]>
>     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
>
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to