Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Maheshakya Wijewardena Tue, 08 Apr 2014 03:38:13 -0700

Hi all,
I have updated the timeline in my proposal. Can you review it and give me
your feedback before I finalize it.


Thank you,
Maheshakya.


On Wed, Mar 19, 2014 at 9:42 PM, Maheshakya Wijewardena <
[email protected]> wrote:

> Daniel,
>
> Moreover, those libraries(FLANN, ANNOY and Kgraph) are not implemented in
> pure Python though they have Python wrappers. Their complexity is also
> high(except for ANNOY). So integrating my prototypes into those will not be
> feasible. I think the best is to have quick simple implementation of
> LSH-Forest only to test on the selected datasets.
>
>
> On Wed, Mar 19, 2014 at 8:36 PM, Maheshakya Wijewardena <
> [email protected]> wrote:
>
>> Hi Robert,
>>
>> I see what you are saying about the template, but you can also add extra
>>> information up the front!
>>
>>
>> I have made an indication of what I have done related to machine learning
>> previously in the my personal statement. But the links and description my
>> work remain at the same position.
>>
>>  I've read your proposal. Remember that evaluating doesn't bring home the
>>> bacon: you want an effective ANN solution that scikit learn can actually
>>> integrate. By all accounts, plain LSH is not competitive. Then you
>>> should evaluate it only to the extent it requires almost no extra
>>> effort. Since (to my understanding) scikit-learn prefers not to take on
>>> any dependencies, and prefers not to maintain complex tree based data
>>> structures, you should be wary where you spend implementation time. Do
>>> you have a practical plan for implementing any promising variant other
>>> than LSH-Forest?
>>> Concretely, I would implement LSHForest first; create a variant query
>>> method that uses constant k (thus implementing plain LSH). Connect to 1
>>> or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
>>> and evaluate variations comparing to them. An implementation in
>>> scikit-learn that is in the ballpark of any of those would be an
>>> excellent achievement in my opinion for a GSOC. Is there benchmarking
>>> code that you can adopt into a scikit learn demo/benchmark?
>>
>>
>> Daniel, as you have mentioned I wont put much effort on implementing the
>> vanilla LSH algorithm. I will just use existing Python implementations for
>> evaluation. Moreover, My priority is to prototype LSH-Forest and analyze it
>> (This is indicated in the updated proposal). But I will allocate sometime
>> to ANNOY as well.I am currently preparing my timeline for the project.
>>
>> I think we do not need to connect LSH-Forest prototype to those
>> libraries. We just need a quick hack for the data structure and get it
>> tested. But the result will have to be compared with them.
>>
>> Currently, I do not have a bench mark code. But I will design one in the
>> way that Radim has used in his evaluation.
>>
>> Best Regards,
>> Maheshakya
>>
>>
>> On Wed, Mar 19, 2014 at 3:48 AM, Robert Layton <[email protected]>wrote:
>>
>>> I see what you are saying about the template, but you can also add extra
>>> information up the front!
>>> My reasoning behind that comment -- I read your background and, at that
>>> point, didn't understand why you were submitting for a python project
>>> specifically.
>>>
>>> Afterwards, it is clear, but a rule of thumb for grant applications:
>>> don't make the reviewers guess, and don't surprise them!
>>>
>>>
>>> On 19 March 2014 07:17, Daniel Vainsencher <[email protected]
>>> > wrote:
>>>
>>>> Hi Maheshakya,
>>>>
>>>> I've read your proposal. Remember that evaluating doesn't bring home the
>>>> bacon: you want an effective ANN solution that scikit learn can actually
>>>> integrate. By all accounts, plain LSH is not competitive. Then you
>>>> should evaluate it only to the extent it requires almost no extra
>>>> effort. Since (to my understanding) scikit-learn prefers not to take on
>>>> any dependencies, and prefers not to maintain complex tree based data
>>>> structures, you should be wary where you spend implementation time. Do
>>>> you have a practical plan for implementing any promising variant other
>>>> than LSH-Forest?
>>>>
>>>> Concretely, I would implement LSHForest first; create a variant query
>>>> method that uses constant k (thus implementing plain LSH). Connect to 1
>>>> or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
>>>> and evaluate variations comparing to them. An implementation in
>>>> scikit-learn that is in the ballpark of any of those would be an
>>>> excellent achievement in my opinion for a GSOC. Is there benchmarking
>>>> code that you can adopt into a scikit learn demo/benchmark?
>>>>
>>>> Remember to allocate time to tweaking the query code: there are a couple
>>>> of different algorithms in the LSH F paper, and many variants can be
>>>> invented.
>>>>
>>>> Daniel
>>>>
>>>> On 03/18/2014 04:47 PM, Maheshakya Wijewardena wrote:
>>>> > Thank you Robert for your valuable feedback. I will update my proposal
>>>> > accordingly asap and let you know.
>>>> >
>>>> > However, I prepared this according to the template given by Python
>>>> > organization. So I have put information about the projects I have done
>>>> > in that position according to template. If scikit-learn expects it to
>>>> be
>>>> > otherwise, I can change it.
>>>> >
>>>> > I have another thing to discuss here before finalizing my proposal.
>>>> > Mr. Wei Dong has informed me that there is another state-of-art
>>>> approach
>>>> > he implemented to perform ANN search and asked to see if that can be
>>>> > used in the evaluating stage. This implementation is called KGraph
>>>> > <http://www.kgraph.org/index.php?n=Main.Python>. It has comparisons
>>>> with
>>>> > FLANN, I have started reading it. Maybe you can have a look into that
>>>> as
>>>> > well.
>>>> >
>>>> > Best regards,
>>>> > Maheshakya
>>>> >
>>>> >
>>>> > On Tue, Mar 18, 2014 at 4:08 AM, Robert Layton <
>>>> [email protected]
>>>> > <mailto:[email protected]>> wrote:
>>>> >
>>>> >     Thanks for sending this.
>>>> >
>>>> >     The grammar needs updating, but I'll let you worry about
>>>> >     proofreading. Comments below:
>>>> >     - State some /evidence/ for your experience with python, i.e.
>>>> >     projects you worked on or links to big commits (this is covered
>>>> late
>>>> >     in the application, move some information forward).
>>>> >     - Don't choose a project because "it was on the list" -- why is
>>>> the
>>>> >     problem important and what can *you* do to help?
>>>> >     - Be more confident! "This is the most vital part of this project
>>>> as
>>>> >     I think." Ensure the whole project is planned and evaluated.
>>>> >     - Can you talk a bit more about which hashing algorithms will be
>>>> >     implemented? This is a major part of the work, but isn't
>>>> >     appropriately covered.
>>>> >     - In your abstract, I would make certain your plan -- you want to
>>>> >     implement prototypes, evaluate them and then finish the
>>>> >     implementations. Make this absolutely clear (currently, the reader
>>>> >     has to piece this together for themselves)
>>>> >     - References need to be done more thoroughly.
>>>> >
>>>> >     I think it's a good project, now you need to sell it!
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >     On 18 March 2014 06:55, Maheshakya Wijewardena
>>>> >     <[email protected] <mailto:[email protected]>> wrote:
>>>> >
>>>> >         Hi,
>>>> >
>>>> >         I have written a proposal for this project and it is in the
>>>> wiki,
>>>> >
>>>> https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search
>>>> >
>>>> >         It is not complete yet. I have not written the timeline yet.
>>>> But
>>>> >         I have described my plan for this project. I think the
>>>> >         prototyping part of this project should be started during the
>>>> >         community bonding period, while fixing other issues because
>>>> that
>>>> >         needs to be done as soon as possible. I will update this with
>>>> my
>>>> >         timeline soon.
>>>> >
>>>> >         Can you review this and give me your feedback.
>>>> >
>>>> >         Thank you,
>>>> >         Maheshakya.
>>>> >
>>>> >
>>>> >         On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher
>>>> >         <[email protected]
>>>> >         <mailto:[email protected]>> wrote:
>>>> >
>>>> >             On 03/16/2014 08:59 PM, Olivier Grisel wrote:
>>>> >              > 2014-03-14 19:40 GMT+01:00 Daniel Vainsencher
>>>> >             <[email protected]
>>>> >             <mailto:[email protected]>>:
>>>> >              >> Hi Olivier,
>>>> >              >>
>>>> >              >> Have you looked at the LSH-Forest (and to a lesser
>>>> >             extent, Multiprobe)
>>>> >              >> paper? I've used it in practice, and it makes a
>>>> >             significant improvement
>>>> >              >> vs. basic LSH. I've proposed in recent emails to the
>>>> >             list how to
>>>> >              >> implement it based on sorting and binary search (in
>>>> >             order to avoid the
>>>> >              >> dependency/maintenance burden of range-queries).
>>>> >              >
>>>> >              > No I did not read those papers (yet).
>>>> >             I don't know the culture here, but since you seemed to
>>>> >             summarize rather
>>>> >             than wait for someone to read up on LSH:
>>>> >             - Where LSH uses hash-maps keyed on the a random hash of
>>>> >             length exactly
>>>> >             k, LSHF uses an ordered map (binary tree or BTree in the
>>>> >             paper). This
>>>> >             enables queries of form:
>>>> >             "all keys whose prefix of length h<=k matches this hash".
>>>> >             This trivially
>>>> >             enables a per tree query of type "give me the best q
>>>> hashes
>>>> >             (and maybe a
>>>> >             few more)" by doing queries with h=k, h=k-1,... until
>>>> >             satisfied (this
>>>> >             can be optimized in different ways).
>>>> >             - Then queries across multiple trees can be combined in
>>>> >             different ways,
>>>> >             the simplest being "give me at least q/L candidates from
>>>> >             each tree, take
>>>> >             the union". More complex variants don't seem to make much
>>>> >             difference.
>>>> >             - So, the parameter k becomes essentially per-query, L can
>>>> >             be determined
>>>> >             on its own, and we can determine the number of expensive
>>>> >             "real distance"
>>>> >             comparisons we want to pay per query.
>>>> >
>>>> >             I offer the simple observation that sorted arrays support
>>>> >             very fast
>>>> >             range queries, so that complex/expensive trees are not
>>>> >             necessary for an
>>>> >             LSHF implementation.
>>>> >
>>>> >             As Maheshakya said it might very
>>>> >              > well be the method implemented in Annoy.
>>>> >             It seems similar but distinct, see a recent mail.
>>>> >
>>>> >              > Do you have sample python
>>>> >              > code around?
>>>> >             No, I implemented it in C#, and source is not available.
>>>> >
>>>> >              >> I think evaluating and tweaking vanilla LSH is a waste
>>>> >             of time, when an
>>>> >              >> obvious improvement that fixes a significant real
>>>> world
>>>> >             flaw (control of
>>>> >              >> number of candidates seen) in the method is available.
>>>> >              > If the obvious improvement does not come with a
>>>> >             significant increment
>>>> >              > in code complexity (and maintenance burden) then I
>>>> agree,
>>>> >             yes.
>>>> >             So, was the above was sufficient to answer your
>>>> prerequisite?
>>>> >
>>>> >              >Getting rid of hyperparameters to select manually of via
>>>> >             grid-search
>>>> >              > is a big usability improvement and can make the method
>>>> >             much more
>>>> >              > practical to use.
>>>> >             In this case, performance becomes better than mere tuning
>>>> >             would allow,
>>>> >             since it adapts to database density around the query
>>>> point.
>>>> >
>>>> >             Daniel
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> >             Learn Graph Databases - Download FREE O'Reilly Book
>>>> >             "Graph Databases" is the definitive new guide to graph
>>>> >             databases and their
>>>> >             applications. Written by three acclaimed leaders in the
>>>> field,
>>>> >             this first edition is now available. Download your free
>>>> book
>>>> >             today!
>>>> >             http://p.sf.net/sfu/13534_NeoTech
>>>> >             _______________________________________________
>>>> >             Scikit-learn-general mailing list
>>>> >             [email protected]
>>>> >             <mailto:[email protected]>
>>>> >
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >         --
>>>> >         Undergraduate,
>>>> >         Department of Computer Science and Engineering,
>>>> >         Faculty of Engineering.
>>>> >         University of Moratuwa,
>>>> >         Sri Lanka
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> >         Learn Graph Databases - Download FREE O'Reilly Book
>>>> >         "Graph Databases" is the definitive new guide to graph
>>>> databases
>>>> >         and their
>>>> >         applications. Written by three acclaimed leaders in the field,
>>>> >         this first edition is now available. Download your free book
>>>> today!
>>>> >         http://p.sf.net/sfu/13534_NeoTech
>>>> >         _______________________________________________
>>>> >         Scikit-learn-general mailing list
>>>> >         [email protected]
>>>> >         <mailto:[email protected]>
>>>> >
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> >     Learn Graph Databases - Download FREE O'Reilly Book
>>>> >     "Graph Databases" is the definitive new guide to graph databases
>>>> and
>>>> >     their
>>>> >     applications. Written by three acclaimed leaders in the field,
>>>> >     this first edition is now available. Download your free book
>>>> today!
>>>> >     http://p.sf.net/sfu/13534_NeoTech
>>>> >     _______________________________________________
>>>> >     Scikit-learn-general mailing list
>>>> >     [email protected]
>>>> >     <mailto:[email protected]>
>>>> >     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Undergraduate,
>>>> > Department of Computer Science and Engineering,
>>>> > Faculty of Engineering.
>>>> > University of Moratuwa,
>>>> > Sri Lanka
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>> > "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> > applications. Written by three acclaimed leaders in the field,
>>>> > this first edition is now available. Download your free book today!
>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Scikit-learn-general mailing list
>>>> > [email protected]
>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> applications. Written by three acclaimed leaders in the field,
>>>> this first edition is now available. Download your free book today!
>>>> http://p.sf.net/sfu/13534_NeoTech
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> applications. Written by three acclaimed leaders in the field,
>>> this first edition is now available. Download your free book today!
>>> http://p.sf.net/sfu/13534_NeoTech
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>> Sri Lanka
>>
>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>



-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to