[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893117#comment-15893117 ]
Mingjie Tang commented on SPARK-19771: -------------------------------------- (1) because you need to explode each tuple. For example mentioned above, for one input tuple, you have to build 3 rows, and each hashvalue contain a vector is the length of hash functions. thus, for one tuple, your memory overhead is NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the overhead is NumHashFunctions*NumHashTables*N. (2) yes, the hashvalue can be any based on your input bucketwidth W. Actually, it should be very big for less collision. (3) I am not sure the hashCode can work, because we need to use this function for multi-probe searching. > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > ---------------------------------------------------------------- > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.1.0 > Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org