Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni Yes, we can use the AND-OR to increase the possibility by having more the numHashTables and numHashFunctions. For the further user extension, if users have a hash function with lower possibility, the OR-AND could be used. (1) I do not need to change Array[Vector], numHashTables, numHashFunctions, we need to change the function to compute the hashDistance (i.e.,hashDistance), as well as the sameBucket function in the approxNearestNeighbors. (3) for the simijoin, I have one question here, if you do a join based on the hashed value of input tuples, the joined key would be array(vector). Am i right? if it is, does this meet OR-amplification? please clarify me if I am wrong. (4) for the index part, I think it would be work. it is pretty similar as the routing table idea for the graphx. thus, I think we can create a other data frame with the same partitioner of the input data frame, then, the newly created data frame would contain the index for the input tables without disturbing the data frame. 5) the other major concern would be memory overhead, Can we reduce the memory usage for the output hash value i.e., array(vector)? Because the users said that the current way spent extensive of memory. therefore, one way we can do using the bit to respected the hashed value for the min-hash, the other way would use the sparse vector. what do you think ?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org