Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Ronnie Ghose Sat, 08 Mar 2014 22:44:20 -0800

rather than approval of the community you should ask for a mentor as I
assumed you read the GSOC idea list - a mentor is a heavy job relatively



On Sun, Mar 9, 2014 at 1:36 AM, Maheshakya Wijewardena <
[email protected]> wrote:

> Hi Daniel,
>
> Thank you for pointing this out. I also have noticed that storing hash
> tables of training data and dealing with binary vectors might be
> problematic when using plain LSH based ANN.
> However, before implementing LSH forests, I need to get the approval of
> the community. I'm looking forward to have feedback on this.
>
> Thank you,
> Maheshakya.
>
>
> On Thu, Mar 6, 2014 at 2:34 PM, Daniel Vainsencher <
> [email protected]> wrote:
>
>>  Hi Maheshakya,
>>
>> In regular LSH, a particular setting of the number of hash functions per
>> index (k) and the number of indexes (L) essentially determines the size of
>> the region in space from which candidates will be chosen in response to a
>> query. If queries q1 and q2 are in areas where the density of data points
>> is very different, then one will receive many more candidates than the
>> other, thus be slower and more accurate. Thus tuning k, L involves a
>> tradeoff, and with varied density, you cannot win.
>>
>> Even worse: as you add data, your k, L stop being optimal. So tuning on
>> a  significantly smaller sample is not effective, which makes tuning even
>> more expensive.
>>
>> Going back to LSH Forest, while an implementation using binary trees is
>> possible and ok for testing, but not very space or cache efficient. Other
>> data structures (b-trees and variants, see [2] for some discussion) are
>> more appropriate. The key operations are range queries and depending on
>> data change rate, insertion.
>>
>> Space efficiency of indexes is also affected by the hash type: if you use
>> a binary hash functions, choose k in {32, 64} and use the appropriate
>> integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
>> has no difficulty with such high k in the index (as opposed to plain LSH).
>>
>> Daniel Vainsencher
>>
>>
>> On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:
>>
>> Hi Daniel,
>> These alternative strategies of LSH are a good way to start using the
>> existing implementations of Scikit-learn. But why do you say the standard
>> algorithms are impractical?
>>
>>  Maheshakya
>>
>>
>>  On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <
>> [email protected]> wrote:
>>
>>>  Hi scikit-learn, Maheshakya,
>>>
>>> In my experience implementing and using LSH, the vanilla algorithms are
>>> very impractical. Two extensions that help a lot (and combine nicely) are:
>>> - Multi-probe LSH [1], where hash values neighboring the target's hash
>>> are also used.
>>> - LSH Forest [2], where fixed length hashes are replaced by length
>>> bounded hashes; the hash used is just long enough to get sufficiently few
>>> candidates for the target's nearest neighbor.
>>>
>>> Together they make it feasible to limit the number of independent copies
>>> of the index to a reasonable number. Incidentally, the LSH Forest approach
>>> may be implemented using binary trees.
>>>
>>> Daniel Vainsencher
>>>
>>> [1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
>>> [2] www2005.org/docs/p651.pdf
>>>
>>>
>>> On 03/04/2014 06:09 PM, Maheshakya Wijewardena wrote:
>>>
>>>  Hi,
>>>
>>> Currently, unsupervised neighbor search is implemented in
>>> sklearn.neighbors. There are three algorithms used to perform neighbor
>>> search.
>>>
>>>    - kd tree
>>>    - ball tree
>>>    - brute force method
>>>
>>> kd tree and ball tree are implemented separately as cython
>>> implementations. Both of them use binary tree(binary_tree.pxi). In the
>>> NeighborBass class, above implementations are used.
>>>
>>>  As Locality Sensitivity Hashing will also be used in approximating
>>> nearest neighbors, its' implementation should also be a part of
>>> sklearn.neighbors. But Locality sensitivity hashing algorithms do not
>>> involve binary tree implementation. Therefore it has to be implemented
>>> separately and be used in sklearn.neighbors as another algorithm.Is this
>>> approach correct?
>>>
>>>  I'm trying to implement a base LSH class and the hashing algorithms as
>>> cython implementations. To approximate neighbor search using those
>>> algorithms, multiple hash tables will be created with those hash functions
>>> so, for that an efficient way to store is required.
>>>
>>>  These are some of the issues I found disturbing while I was trying to
>>> implement LSH to approximate neighbor search. Is this an appropriate
>>> approach to implement this or are there any other better ways?
>>>
>>>
>>>
>>> Regards,
>>> Maheshakya,
>>> --
>>> Undergraduate,
>>> Department of Computer Science and Engineering,
>>> Faculty of Engineering.
>>> University of Moratuwa,
>>> Sri Lanka
>>>
>>>
>>>  
>>> ------------------------------------------------------------------------------
>>> Subversion Kills Productivity. Get off Subversion & Make the Move to 
>>> Perforce.
>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>> Faster operations. Version large binaries.  Built-in WAN optimization and 
>>> the
>>> freedom to use Git, Perforce or both. Make the move to 
>>> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>>
>>>
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing 
>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>>> Perforce.
>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>> Faster operations. Version large binaries.  Built-in WAN optimization
>>> and the
>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>>  --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>> Sri Lanka
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to 
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and the
>> freedom to use Git, Perforce or both. Make the move to 
>> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to