[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893117#comment-15893117
 ] 

Mingjie Tang commented on SPARK-19771:
--------------------------------------

(1) because you need to explode each tuple. For example mentioned above, for 
one input tuple, you have to build 3 rows, and each hashvalue contain a vector 
is the length of hash functions. thus, for one tuple, your memory overhead is 
NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the 
overhead is NumHashFunctions*NumHashTables*N. 

(2) yes, the hashvalue can be any based on your input bucketwidth W. Actually, 
it should be very big for less collision.

(3) I am not sure the hashCode can work, because we need to use this function 
for multi-probe searching.  

> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
>                 Key: SPARK-19771
>                 URL: https://issues.apache.org/jira/browse/SPARK-19771
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to