Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 It sounds like discussions are converging, but I want to confirm a few things + make a few additions. ### Amplification Is this agreed? * Approx neighbors and similarity are doing OR-amplification when comparing hash values, as described in the [Wikipedia article](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions). This is computing an amplified hash function *implicitly*. * transform() is not doing amplification. It outputs the value of a collection of hash functions, rather than aggregating them to do amplification. * This is my main question: Is amplification ever done explicitly, and when would you ever need that? Adding combined AND and OR amplification in the future sounds good to me. My main question right now is whether we need to adjust the API before the 2.1 release. I don't see a need to, but please comment if you see an issue with the current API. * One possibility: We could rename outputDim to something specific to OR-amplification. Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set. ### Random Projection I agree this should be renamed to something like "PStableHashing." My apologies for not doing enough background research to disambiguate. ### MinHash I think this is implemented correctly, according to the reference given in the linked Wikipedia article. * [This reference](https://github.com/apache/spark/blob/8f0ea011a7294679ec4275b2fef349ef45b6eb81/mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala#L36) to perfect hash functions may be misleading. I'd prefer to remove it. ### hashDistance Rethinking this, I am unsure about what function we should use. Currently, hashDistance is only used by approxNearestNeighbors. Since approxNearestNeighbors sorts by hashDistance, using a soft measure might be better than what we currently have: * MinHash * Currently: Uses OR-amplification for single probing, and something odd for multiple probing * Best option for approxNearestNeighbors: [this Wikipedia section](https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions), which is equivalent or OR-amplification when using single probing. I.e., replace [this line of code](https://github.com/apache/spark/blob/8f0ea011a7294679ec4275b2fef349ef45b6eb81/mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala#L79) with: ```x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size``` * RandomProjection * Currently: Uses OR-amplification for single probing, and something reasonable for multiple probing @Yunni What is the best resource you have for single vs multiple probing? I'm wondering now if they are uncommon terms and should be renamed.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org