[ 
https://issues.apache.org/jira/browse/DATAFU-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994094#comment-13994094
 ] 

Casey Stella edited comment on DATAFU-37 at 5/10/14 1:55 AM:
-------------------------------------------------------------

I added a comment about the seed being required.  Maybe it's reasonable to move 
the discussion here.  I'll copy my comment from reviewboard.  The question at 
hand is whether we should require a seed to be specified or whether we should 
just have each LSH use a random seed:

I'm just concerned that if it's a random seed, then different tasks effectively 
create different LSH functions with the same lsh_id because that RNG is used in 
constructing the definition of the LSH function.  The whole notion of the 
lsh_id is that you create k different hashes and hash all points for each, then 
for each hash (unique lsh_id), search in the set of values which share the hash 
with the query point.  IF we make it a random seed, what will happen is that 
for hash function with lsh_id x, points hashed in task i and task j will be 
hashed with different functions.

Thoughts?

EDIT: What if I generated the seed from a RNG on the frontend in the 
outputSchema and passed it in via the UDFContext.  That way we have a 
consistent seed, which means consistent LSH functions across the cluster.  
Would that be suitable?


was (Author: cestella):
I added a comment about the seed being required.  Maybe it's reasonable to move 
the discussion here.  I'll copy my comment from reviewboard.  The question at 
hand is whether we should require a seed to be specified or whether we should 
just have each LSH use a random seed:

I'm just concerned that if it's a random seed, then different tasks effectively 
create different LSH functions with the same lsh_id because that RNG is used in 
constructing the definition of the LSH function.  The whole notion of the 
lsh_id is that you create k different hashes and hash all points for each, then 
for each hash (unique lsh_id), search in the set of values which share the hash 
with the query point.  IF we make it a random seed, what will happen is that 
for hash_id x, points hashed in task i and task j will be hashed with different 
functions.

Thoughts?

EDIT: What if I generated the seed from a RNG on the frontend in the 
outputSchema and passed it in via the UDFContext.  That way we have a 
consistent seed, which means consistent LSH functions across the cluster.  
Would that be suitable?

> Add Locality Sensitive Hashing UDFs
> -----------------------------------
>
>                 Key: DATAFU-37
>                 URL: https://issues.apache.org/jira/browse/DATAFU-37
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Casey Stella
>            Assignee: Casey Stella
>         Attachments: DATAFU-37-1.patch, DATAFU-37-2.patch, DATAFU-37.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Create a set of UDFs to implement [Locality Sensitive 
> Hashing|http://en.wikipedia.org/wiki/Locality-sensitive_hashing] in support 
> of finding k-near neighbors.   Initially, hashes associated with L1, L2 and 
> Cosine similarity should be supported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to