I think I'd be better as a second mentor -- this would be my first GSoC of
any type.
That said, I am here to help!

*That said*, you'll need to make the case to the community -- the Forest
LSH is a variation of a key algorithm. LSH by itself is definitely worthy
of inclusion in scikit-learn, and it looks like Forest LSH would be
sufficiently well-used (Google tells me 138 citations), but I think we need
the community's decision.


On 9 March 2014 17:51, Maheshakya Wijewardena <[email protected]>wrote:

> Yes, I think Robert Layton is a possible mentor for this project. I have
> already discussed some facts about this with him in the issues in Github.
> I was thinking of getting the opinion of other contributors as well as
> they may have something to say about this or other suggestions for better
> ways of implementing.
>
>
>
> On Sun, Mar 9, 2014 at 12:12 PM, Ronnie Ghose <[email protected]>wrote:
>
>> rather than approval of the community you should ask for a mentor as I
>> assumed you read the GSOC idea list - a mentor is a heavy job relatively
>>
>>
>> On Sun, Mar 9, 2014 at 1:36 AM, Maheshakya Wijewardena <
>> [email protected]> wrote:
>>
>>> Hi Daniel,
>>>
>>> Thank you for pointing this out. I also have noticed that storing hash
>>> tables of training data and dealing with binary vectors might be
>>> problematic when using plain LSH based ANN.
>>> However, before implementing LSH forests, I need to get the approval of
>>> the community. I'm looking forward to have feedback on this.
>>>
>>> Thank you,
>>> Maheshakya.
>>>
>>>
>>> On Thu, Mar 6, 2014 at 2:34 PM, Daniel Vainsencher <
>>> [email protected]> wrote:
>>>
>>>>  Hi Maheshakya,
>>>>
>>>> In regular LSH, a particular setting of the number of hash functions
>>>> per index (k) and the number of indexes (L) essentially determines the size
>>>> of the region in space from which candidates will be chosen in response to
>>>> a query. If queries q1 and q2 are in areas where the density of data points
>>>> is very different, then one will receive many more candidates than the
>>>> other, thus be slower and more accurate. Thus tuning k, L involves a
>>>> tradeoff, and with varied density, you cannot win.
>>>>
>>>> Even worse: as you add data, your k, L stop being optimal. So tuning on
>>>> a  significantly smaller sample is not effective, which makes tuning even
>>>> more expensive.
>>>>
>>>> Going back to LSH Forest, while an implementation using binary trees is
>>>> possible and ok for testing, but not very space or cache efficient. Other
>>>> data structures (b-trees and variants, see [2] for some discussion) are
>>>> more appropriate. The key operations are range queries and depending on
>>>> data change rate, insertion.
>>>>
>>>> Space efficiency of indexes is also affected by the hash type: if you
>>>> use a binary hash functions, choose k in {32, 64} and use the appropriate
>>>> integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
>>>> has no difficulty with such high k in the index (as opposed to plain LSH).
>>>>
>>>> Daniel Vainsencher
>>>>
>>>>
>>>> On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:
>>>>
>>>> Hi Daniel,
>>>> These alternative strategies of LSH are a good way to start using the
>>>> existing implementations of Scikit-learn. But why do you say the standard
>>>> algorithms are impractical?
>>>>
>>>>  Maheshakya
>>>>
>>>>
>>>>  On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <
>>>> [email protected]> wrote:
>>>>
>>>>>  Hi scikit-learn, Maheshakya,
>>>>>
>>>>> In my experience implementing and using LSH, the vanilla algorithms
>>>>> are very impractical. Two extensions that help a lot (and combine nicely)
>>>>> are:
>>>>> - Multi-probe LSH [1], where hash values neighboring the target's hash
>>>>> are also used.
>>>>> - LSH Forest [2], where fixed length hashes are replaced by length
>>>>> bounded hashes; the hash used is just long enough to get sufficiently few
>>>>> candidates for the target's nearest neighbor.
>>>>>
>>>>> Together they make it feasible to limit the number of independent
>>>>> copies of the index to a reasonable number. Incidentally, the LSH Forest
>>>>> approach may be implemented using binary trees.
>>>>>
>>>>> Daniel Vainsencher
>>>>>
>>>>> [1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
>>>>> [2] www2005.org/docs/p651.pdf
>>>>>
>>>>>
>>>>> On 03/04/2014 06:09 PM, Maheshakya Wijewardena wrote:
>>>>>
>>>>>  Hi,
>>>>>
>>>>> Currently, unsupervised neighbor search is implemented in
>>>>> sklearn.neighbors. There are three algorithms used to perform neighbor
>>>>> search.
>>>>>
>>>>>    - kd tree
>>>>>    - ball tree
>>>>>    - brute force method
>>>>>
>>>>> kd tree and ball tree are implemented separately as cython
>>>>> implementations. Both of them use binary tree(binary_tree.pxi). In the
>>>>> NeighborBass class, above implementations are used.
>>>>>
>>>>>  As Locality Sensitivity Hashing will also be used in approximating
>>>>> nearest neighbors, its' implementation should also be a part of
>>>>> sklearn.neighbors. But Locality sensitivity hashing algorithms do not
>>>>> involve binary tree implementation. Therefore it has to be implemented
>>>>> separately and be used in sklearn.neighbors as another algorithm.Is this
>>>>> approach correct?
>>>>>
>>>>>  I'm trying to implement a base LSH class and the hashing algorithms
>>>>> as cython implementations. To approximate neighbor search using those
>>>>> algorithms, multiple hash tables will be created with those hash functions
>>>>> so, for that an efficient way to store is required.
>>>>>
>>>>>  These are some of the issues I found disturbing while I was trying
>>>>> to implement LSH to approximate neighbor search. Is this an appropriate
>>>>> approach to implement this or are there any other better ways?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Maheshakya,
>>>>> --
>>>>> Undergraduate,
>>>>> Department of Computer Science and Engineering,
>>>>> Faculty of Engineering.
>>>>> University of Moratuwa,
>>>>> Sri Lanka
>>>>>
>>>>>
>>>>>  
>>>>> ------------------------------------------------------------------------------
>>>>> Subversion Kills Productivity. Get off Subversion & Make the Move to 
>>>>> Perforce.
>>>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>>>> Faster operations. Version large binaries.  Built-in WAN optimization and 
>>>>> the
>>>>> freedom to use Git, Perforce or both. Make the move to 
>>>>> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing 
>>>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>>>>> Perforce.
>>>>> With Perforce, you get hassle-free workflows. Merge that actually
>>>>> works.
>>>>> Faster operations. Version large binaries.  Built-in WAN optimization
>>>>> and the
>>>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>> Undergraduate,
>>>> Department of Computer Science and Engineering,
>>>> Faculty of Engineering.
>>>> University of Moratuwa,
>>>> Sri Lanka
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Subversion Kills Productivity. Get off Subversion & Make the Move to 
>>>> Perforce.
>>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>>> Faster operations. Version large binaries.  Built-in WAN optimization and 
>>>> the
>>>> freedom to use Git, Perforce or both. Make the move to 
>>>> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing 
>>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>>>> Perforce.
>>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>>> Faster operations. Version large binaries.  Built-in WAN optimization
>>>> and the
>>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Undergraduate,
>>> Department of Computer Science and Engineering,
>>> Faculty of Engineering.
>>> University of Moratuwa,
>>> Sri Lanka
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>>> Perforce.
>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>> Faster operations. Version large binaries.  Built-in WAN optimization
>>> and the
>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to