[
https://issues.apache.org/jira/browse/DATAFU-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994094#comment-13994094
]
Casey Stella commented on DATAFU-37:
------------------------------------
I added a comment about the seed being required. Maybe it's reasonable to move
the discussion here. I'll copy my comment from reviewboard. The question at
hand is whether we should require a seed to be specified or whether we should
just have each LSH use a random seed:
I'm just concerned that if it's a random seed, then different tasks effectively
create different LSH functions with the same lsh_id because that RNG is used in
constructing the definition of the LSH function. The whole notion of the
lsh_id is that you create k different hashes and hash all points for each, then
for each hash (unique lsh_id), search in the set of values which share the hash
with the query point. IF we make it a random seed, what will happen is that
for hash_id x, points hashed in task i and task j will be hashed with different
functions.
Thoughts?
EDIT: What if I generated the seed from a RNG on the frontend in the
outputSchema and passed it in via the UDFContext. That way we have a
consistent seed, which means consistent LSH functions across the cluster.
Would that be suitable?
> Add Locality Sensitive Hashing UDFs
> -----------------------------------
>
> Key: DATAFU-37
> URL: https://issues.apache.org/jira/browse/DATAFU-37
> Project: DataFu
> Issue Type: New Feature
> Reporter: Casey Stella
> Assignee: Casey Stella
> Attachments: DATAFU-37-1.patch, DATAFU-37-2.patch, DATAFU-37.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Create a set of UDFs to implement [Locality Sensitive
> Hashing|http://en.wikipedia.org/wiki/Locality-sensitive_hashing] in support
> of finding k-near neighbors. Initially, hashes associated with L1, L2 and
> Cosine similarity should be supported.
--
This message was sent by Atlassian JIRA
(v6.2#6252)