[ https://issues.apache.org/jira/browse/DATAFU-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985142#comment-13985142 ]
Matthew Hayes commented on DATAFU-37: ------------------------------------- Something else I was wondering about when going through the code and reading the paper is how to determine the parameters. For CosineDistanceHash the important parameter is: * sRepeat: Number of internal repetitions For L1PStableHash and L2PStableHash the important parameters are: * sW: A double representing the quantization parameter (also known as the projection width) * sRepeat: Number of internal repetitions (generally this should be 1 as the p-stable hashes have a larger range than one bit) You mention that the parameters should be determined empirically. I also came across a presentation you did, file:///Users/mhayes/Downloads/presentation.pdf , where you mention a tool that can assist in choosing the parameters. Do you think we could estimate parameters using a data sample and these UDFs or do we need additional UDFs to do that? > Add Locality Sensitive Hashing UDFs > ----------------------------------- > > Key: DATAFU-37 > URL: https://issues.apache.org/jira/browse/DATAFU-37 > Project: DataFu > Issue Type: New Feature > Reporter: Casey Stella > Assignee: Casey Stella > Attachments: DATAFU-37.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > Create a set of UDFs to implement [Locality Sensitive > Hashing|http://en.wikipedia.org/wiki/Locality-sensitive_hashing] in support > of finding k-near neighbors. Initially, hashes associated with L1, L2 and > Cosine similarity should be supported. -- This message was sent by Atlassian JIRA (v6.2#6252)