[ https://issues.apache.org/jira/browse/DATAFU-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994094#comment-13994094 ]
Casey Stella edited comment on DATAFU-37 at 5/10/14 1:55 AM: ------------------------------------------------------------- I added a comment about the seed being required. Maybe it's reasonable to move the discussion here. I'll copy my comment from reviewboard. The question at hand is whether we should require a seed to be specified or whether we should just have each LSH use a random seed: I'm just concerned that if it's a random seed, then different tasks effectively create different LSH functions with the same lsh_id because that RNG is used in constructing the definition of the LSH function. The whole notion of the lsh_id is that you create k different hashes and hash all points for each, then for each hash (unique lsh_id), search in the set of values which share the hash with the query point. IF we make it a random seed, what will happen is that for hash function with lsh_id x, points hashed in task i and task j will be hashed with different functions. Thoughts? EDIT: What if I generated the seed from a RNG on the frontend in the outputSchema and passed it in via the UDFContext. That way we have a consistent seed, which means consistent LSH functions across the cluster. Would that be suitable? was (Author: cestella): I added a comment about the seed being required. Maybe it's reasonable to move the discussion here. I'll copy my comment from reviewboard. The question at hand is whether we should require a seed to be specified or whether we should just have each LSH use a random seed: I'm just concerned that if it's a random seed, then different tasks effectively create different LSH functions with the same lsh_id because that RNG is used in constructing the definition of the LSH function. The whole notion of the lsh_id is that you create k different hashes and hash all points for each, then for each hash (unique lsh_id), search in the set of values which share the hash with the query point. IF we make it a random seed, what will happen is that for hash_id x, points hashed in task i and task j will be hashed with different functions. Thoughts? EDIT: What if I generated the seed from a RNG on the frontend in the outputSchema and passed it in via the UDFContext. That way we have a consistent seed, which means consistent LSH functions across the cluster. Would that be suitable? > Add Locality Sensitive Hashing UDFs > ----------------------------------- > > Key: DATAFU-37 > URL: https://issues.apache.org/jira/browse/DATAFU-37 > Project: DataFu > Issue Type: New Feature > Reporter: Casey Stella > Assignee: Casey Stella > Attachments: DATAFU-37-1.patch, DATAFU-37-2.patch, DATAFU-37.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > Create a set of UDFs to implement [Locality Sensitive > Hashing|http://en.wikipedia.org/wiki/Locality-sensitive_hashing] in support > of finding k-near neighbors. Initially, hashes associated with L1, L2 and > Cosine similarity should be supported. -- This message was sent by Atlassian JIRA (v6.2#6252)