Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Maheshakya Wijewardena Mon, 17 Mar 2014 12:57:07 -0700

Hi,

I have written a proposal for this project and it is in the wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search


It is not complete yet. I have not written the timeline yet. But I have
described my plan for this project. I think the prototyping part of this
project should be started during the community bonding period, while fixing
other issues because that needs to be done as soon as possible. I will
update this with my timeline soon.

Can you review this and give me your feedback.

Thank you,
Maheshakya.


On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher <
[email protected]> wrote:

> On 03/16/2014 08:59 PM, Olivier Grisel wrote:
> > 2014-03-14 19:40 GMT+01:00 Daniel Vainsencher <
> [email protected]>:
> >> Hi Olivier,
> >>
> >> Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
> >> paper? I've used it in practice, and it makes a significant improvement
> >> vs. basic LSH. I've proposed in recent emails to the list how to
> >> implement it based on sorting and binary search (in order to avoid the
> >> dependency/maintenance burden of range-queries).
> >
> > No I did not read those papers (yet).
> I don't know the culture here, but since you seemed to summarize rather
> than wait for someone to read up on LSH:
> - Where LSH uses hash-maps keyed on the a random hash of length exactly
> k, LSHF uses an ordered map (binary tree or BTree in the paper). This
> enables queries of form:
> "all keys whose prefix of length h<=k matches this hash". This trivially
> enables a per tree query of type "give me the best q hashes (and maybe a
> few more)" by doing queries with h=k, h=k-1,... until satisfied (this
> can be optimized in different ways).
> - Then queries across multiple trees can be combined in different ways,
> the simplest being "give me at least q/L candidates from each tree, take
> the union". More complex variants don't seem to make much difference.
> - So, the parameter k becomes essentially per-query, L can be determined
> on its own, and we can determine the number of expensive "real distance"
> comparisons we want to pay per query.
>
> I offer the simple observation that sorted arrays support very fast
> range queries, so that complex/expensive trees are not necessary for an
> LSHF implementation.
>
> As Maheshakya said it might very
> > well be the method implemented in Annoy.
> It seems similar but distinct, see a recent mail.
>
> > Do you have sample python
> > code around?
> No, I implemented it in C#, and source is not available.
>
> >> I think evaluating and tweaking vanilla LSH is a waste of time, when an
> >> obvious improvement that fixes a significant real world flaw (control of
> >> number of candidates seen) in the method is available.
> > If the obvious improvement does not come with a significant increment
> > in code complexity (and maintenance burden) then I agree, yes.
> So, was the above was sufficient to answer your prerequisite?
>
> >Getting rid of hyperparameters to select manually of via grid-search
> > is a big usability improvement and can make the method much more
> > practical to use.
> In this case, performance becomes better than mere tuning would allow,
> since it adapts to database density around the query point.
>
> Daniel
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to