Hi all,
I have updated the timeline in my proposal. Can you review it and give me
your feedback before I finalize it.
Thank you,
Maheshakya.
On Wed, Mar 19, 2014 at 9:42 PM, Maheshakya Wijewardena <
[email protected]> wrote:
> Daniel,
>
> Moreover, those libraries(FLANN, ANNOY and Kgraph) are not implemented in
> pure Python though they have Python wrappers. Their complexity is also
> high(except for ANNOY). So integrating my prototypes into those will not be
> feasible. I think the best is to have quick simple implementation of
> LSH-Forest only to test on the selected datasets.
>
>
> On Wed, Mar 19, 2014 at 8:36 PM, Maheshakya Wijewardena <
> [email protected]> wrote:
>
>> Hi Robert,
>>
>> I see what you are saying about the template, but you can also add extra
>>> information up the front!
>>
>>
>> I have made an indication of what I have done related to machine learning
>> previously in the my personal statement. But the links and description my
>> work remain at the same position.
>>
>> I've read your proposal. Remember that evaluating doesn't bring home the
>>> bacon: you want an effective ANN solution that scikit learn can actually
>>> integrate. By all accounts, plain LSH is not competitive. Then you
>>> should evaluate it only to the extent it requires almost no extra
>>> effort. Since (to my understanding) scikit-learn prefers not to take on
>>> any dependencies, and prefers not to maintain complex tree based data
>>> structures, you should be wary where you spend implementation time. Do
>>> you have a practical plan for implementing any promising variant other
>>> than LSH-Forest?
>>> Concretely, I would implement LSHForest first; create a variant query
>>> method that uses constant k (thus implementing plain LSH). Connect to 1
>>> or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
>>> and evaluate variations comparing to them. An implementation in
>>> scikit-learn that is in the ballpark of any of those would be an
>>> excellent achievement in my opinion for a GSOC. Is there benchmarking
>>> code that you can adopt into a scikit learn demo/benchmark?
>>
>>
>> Daniel, as you have mentioned I wont put much effort on implementing the
>> vanilla LSH algorithm. I will just use existing Python implementations for
>> evaluation. Moreover, My priority is to prototype LSH-Forest and analyze it
>> (This is indicated in the updated proposal). But I will allocate sometime
>> to ANNOY as well.I am currently preparing my timeline for the project.
>>
>> I think we do not need to connect LSH-Forest prototype to those
>> libraries. We just need a quick hack for the data structure and get it
>> tested. But the result will have to be compared with them.
>>
>> Currently, I do not have a bench mark code. But I will design one in the
>> way that Radim has used in his evaluation.
>>
>> Best Regards,
>> Maheshakya
>>
>>
>> On Wed, Mar 19, 2014 at 3:48 AM, Robert Layton <[email protected]>wrote:
>>
>>> I see what you are saying about the template, but you can also add extra
>>> information up the front!
>>> My reasoning behind that comment -- I read your background and, at that
>>> point, didn't understand why you were submitting for a python project
>>> specifically.
>>>
>>> Afterwards, it is clear, but a rule of thumb for grant applications:
>>> don't make the reviewers guess, and don't surprise them!
>>>
>>>
>>> On 19 March 2014 07:17, Daniel Vainsencher <[email protected]
>>> > wrote:
>>>
>>>> Hi Maheshakya,
>>>>
>>>> I've read your proposal. Remember that evaluating doesn't bring home the
>>>> bacon: you want an effective ANN solution that scikit learn can actually
>>>> integrate. By all accounts, plain LSH is not competitive. Then you
>>>> should evaluate it only to the extent it requires almost no extra
>>>> effort. Since (to my understanding) scikit-learn prefers not to take on
>>>> any dependencies, and prefers not to maintain complex tree based data
>>>> structures, you should be wary where you spend implementation time. Do
>>>> you have a practical plan for implementing any promising variant other
>>>> than LSH-Forest?
>>>>
>>>> Concretely, I would implement LSHForest first; create a variant query
>>>> method that uses constant k (thus implementing plain LSH). Connect to 1
>>>> or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
>>>> and evaluate variations comparing to them. An implementation in
>>>> scikit-learn that is in the ballpark of any of those would be an
>>>> excellent achievement in my opinion for a GSOC. Is there benchmarking
>>>> code that you can adopt into a scikit learn demo/benchmark?
>>>>
>>>> Remember to allocate time to tweaking the query code: there are a couple
>>>> of different algorithms in the LSH F paper, and many variants can be
>>>> invented.
>>>>
>>>> Daniel
>>>>
>>>> On 03/18/2014 04:47 PM, Maheshakya Wijewardena wrote:
>>>> > Thank you Robert for your valuable feedback. I will update my proposal
>>>> > accordingly asap and let you know.
>>>> >
>>>> > However, I prepared this according to the template given by Python
>>>> > organization. So I have put information about the projects I have done
>>>> > in that position according to template. If scikit-learn expects it to
>>>> be
>>>> > otherwise, I can change it.
>>>> >
>>>> > I have another thing to discuss here before finalizing my proposal.
>>>> > Mr. Wei Dong has informed me that there is another state-of-art
>>>> approach
>>>> > he implemented to perform ANN search and asked to see if that can be
>>>> > used in the evaluating stage. This implementation is called KGraph
>>>> > <http://www.kgraph.org/index.php?n=Main.Python>. It has comparisons
>>>> with
>>>> > FLANN, I have started reading it. Maybe you can have a look into that
>>>> as
>>>> > well.
>>>> >
>>>> > Best regards,
>>>> > Maheshakya
>>>> >
>>>> >
>>>> > On Tue, Mar 18, 2014 at 4:08 AM, Robert Layton <
>>>> [email protected]
>>>> > <mailto:[email protected]>> wrote:
>>>> >
>>>> > Thanks for sending this.
>>>> >
>>>> > The grammar needs updating, but I'll let you worry about
>>>> > proofreading. Comments below:
>>>> > - State some /evidence/ for your experience with python, i.e.
>>>> > projects you worked on or links to big commits (this is covered
>>>> late
>>>> > in the application, move some information forward).
>>>> > - Don't choose a project because "it was on the list" -- why is
>>>> the
>>>> > problem important and what can *you* do to help?
>>>> > - Be more confident! "This is the most vital part of this project
>>>> as
>>>> > I think." Ensure the whole project is planned and evaluated.
>>>> > - Can you talk a bit more about which hashing algorithms will be
>>>> > implemented? This is a major part of the work, but isn't
>>>> > appropriately covered.
>>>> > - In your abstract, I would make certain your plan -- you want to
>>>> > implement prototypes, evaluate them and then finish the
>>>> > implementations. Make this absolutely clear (currently, the reader
>>>> > has to piece this together for themselves)
>>>> > - References need to be done more thoroughly.
>>>> >
>>>> > I think it's a good project, now you need to sell it!
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On 18 March 2014 06:55, Maheshakya Wijewardena
>>>> > <[email protected] <mailto:[email protected]>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I have written a proposal for this project and it is in the
>>>> wiki,
>>>> >
>>>> https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search
>>>> >
>>>> > It is not complete yet. I have not written the timeline yet.
>>>> But
>>>> > I have described my plan for this project. I think the
>>>> > prototyping part of this project should be started during the
>>>> > community bonding period, while fixing other issues because
>>>> that
>>>> > needs to be done as soon as possible. I will update this with
>>>> my
>>>> > timeline soon.
>>>> >
>>>> > Can you review this and give me your feedback.
>>>> >
>>>> > Thank you,
>>>> > Maheshakya.
>>>> >
>>>> >
>>>> > On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher
>>>> > <[email protected]
>>>> > <mailto:[email protected]>> wrote:
>>>> >
>>>> > On 03/16/2014 08:59 PM, Olivier Grisel wrote:
>>>> > > 2014-03-14 19:40 GMT+01:00 Daniel Vainsencher
>>>> > <[email protected]
>>>> > <mailto:[email protected]>>:
>>>> > >> Hi Olivier,
>>>> > >>
>>>> > >> Have you looked at the LSH-Forest (and to a lesser
>>>> > extent, Multiprobe)
>>>> > >> paper? I've used it in practice, and it makes a
>>>> > significant improvement
>>>> > >> vs. basic LSH. I've proposed in recent emails to the
>>>> > list how to
>>>> > >> implement it based on sorting and binary search (in
>>>> > order to avoid the
>>>> > >> dependency/maintenance burden of range-queries).
>>>> > >
>>>> > > No I did not read those papers (yet).
>>>> > I don't know the culture here, but since you seemed to
>>>> > summarize rather
>>>> > than wait for someone to read up on LSH:
>>>> > - Where LSH uses hash-maps keyed on the a random hash of
>>>> > length exactly
>>>> > k, LSHF uses an ordered map (binary tree or BTree in the
>>>> > paper). This
>>>> > enables queries of form:
>>>> > "all keys whose prefix of length h<=k matches this hash".
>>>> > This trivially
>>>> > enables a per tree query of type "give me the best q
>>>> hashes
>>>> > (and maybe a
>>>> > few more)" by doing queries with h=k, h=k-1,... until
>>>> > satisfied (this
>>>> > can be optimized in different ways).
>>>> > - Then queries across multiple trees can be combined in
>>>> > different ways,
>>>> > the simplest being "give me at least q/L candidates from
>>>> > each tree, take
>>>> > the union". More complex variants don't seem to make much
>>>> > difference.
>>>> > - So, the parameter k becomes essentially per-query, L can
>>>> > be determined
>>>> > on its own, and we can determine the number of expensive
>>>> > "real distance"
>>>> > comparisons we want to pay per query.
>>>> >
>>>> > I offer the simple observation that sorted arrays support
>>>> > very fast
>>>> > range queries, so that complex/expensive trees are not
>>>> > necessary for an
>>>> > LSHF implementation.
>>>> >
>>>> > As Maheshakya said it might very
>>>> > > well be the method implemented in Annoy.
>>>> > It seems similar but distinct, see a recent mail.
>>>> >
>>>> > > Do you have sample python
>>>> > > code around?
>>>> > No, I implemented it in C#, and source is not available.
>>>> >
>>>> > >> I think evaluating and tweaking vanilla LSH is a waste
>>>> > of time, when an
>>>> > >> obvious improvement that fixes a significant real
>>>> world
>>>> > flaw (control of
>>>> > >> number of candidates seen) in the method is available.
>>>> > > If the obvious improvement does not come with a
>>>> > significant increment
>>>> > > in code complexity (and maintenance burden) then I
>>>> agree,
>>>> > yes.
>>>> > So, was the above was sufficient to answer your
>>>> prerequisite?
>>>> >
>>>> > >Getting rid of hyperparameters to select manually of via
>>>> > grid-search
>>>> > > is a big usability improvement and can make the method
>>>> > much more
>>>> > > practical to use.
>>>> > In this case, performance becomes better than mere tuning
>>>> > would allow,
>>>> > since it adapts to database density around the query
>>>> point.
>>>> >
>>>> > Daniel
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>> > "Graph Databases" is the definitive new guide to graph
>>>> > databases and their
>>>> > applications. Written by three acclaimed leaders in the
>>>> field,
>>>> > this first edition is now available. Download your free
>>>> book
>>>> > today!
>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>> > _______________________________________________
>>>> > Scikit-learn-general mailing list
>>>> > [email protected]
>>>> > <mailto:[email protected]>
>>>> >
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Undergraduate,
>>>> > Department of Computer Science and Engineering,
>>>> > Faculty of Engineering.
>>>> > University of Moratuwa,
>>>> > Sri Lanka
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>> > "Graph Databases" is the definitive new guide to graph
>>>> databases
>>>> > and their
>>>> > applications. Written by three acclaimed leaders in the field,
>>>> > this first edition is now available. Download your free book
>>>> today!
>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>> > _______________________________________________
>>>> > Scikit-learn-general mailing list
>>>> > [email protected]
>>>> > <mailto:[email protected]>
>>>> >
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>> > "Graph Databases" is the definitive new guide to graph databases
>>>> and
>>>> > their
>>>> > applications. Written by three acclaimed leaders in the field,
>>>> > this first edition is now available. Download your free book
>>>> today!
>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>> > _______________________________________________
>>>> > Scikit-learn-general mailing list
>>>> > [email protected]
>>>> > <mailto:[email protected]>
>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Undergraduate,
>>>> > Department of Computer Science and Engineering,
>>>> > Faculty of Engineering.
>>>> > University of Moratuwa,
>>>> > Sri Lanka
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>> > "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> > applications. Written by three acclaimed leaders in the field,
>>>> > this first edition is now available. Download your free book today!
>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Scikit-learn-general mailing list
>>>> > [email protected]
>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> >
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> applications. Written by three acclaimed leaders in the field,
>>>> this first edition is now available. Download your free book today!
>>>> http://p.sf.net/sfu/13534_NeoTech
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> applications. Written by three acclaimed leaders in the field,
>>> this first edition is now available. Download your free book today!
>>> http://p.sf.net/sfu/13534_NeoTech
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>> Sri Lanka
>>
>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general