Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    It sounds like discussions are converging, but I want to confirm a few 
things + make a few additions.
    
    ### Amplification
    
    Is this agreed?
    * Approx neighbors and similarity are doing OR-amplification when comparing 
hash values, as described in the [Wikipedia 
article](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions).
  This is computing an amplified hash function *implicitly*.
    * transform() is not doing amplification.  It outputs the value of a 
collection of hash functions, rather than aggregating them to do amplification.
      * This is my main question: Is amplification ever done explicitly, and 
when would you ever need that?
    
    Adding combined AND and OR amplification in the future sounds good to me.  
My main question right now is whether we need to adjust the API before the 2.1 
release.  I don't see a need to, but please comment if you see an issue with 
the current API.
    * One possibility: We could rename outputDim to something specific to 
OR-amplification.
    
    Terminology: For LSH, "dimensionality" = "number of hash functions" and is 
relevant only for amplification.  Do you agree?  I have yet to see a hash 
function used for LSH which does not have a discrete set.
    
    ### Random Projection
    
    I agree this should be renamed to something like "PStableHashing."  My 
apologies for not doing enough background research to disambiguate.
    
    ### MinHash
    
    I think this is implemented correctly, according to the reference given in 
the linked Wikipedia article.
    * [This 
reference](https://github.com/apache/spark/blob/8f0ea011a7294679ec4275b2fef349ef45b6eb81/mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala#L36)
 to perfect hash functions may be misleading.  I'd prefer to remove it.
    
    ### hashDistance
    
    Rethinking this, I am unsure about what function we should use.  Currently, 
hashDistance is only used by approxNearestNeighbors.  Since 
approxNearestNeighbors sorts by hashDistance, using a soft measure might be 
better than what we currently have:
    * MinHash
      * Currently: Uses OR-amplification for single probing, and something odd 
for multiple probing
      * Best option for approxNearestNeighbors: [this Wikipedia 
section](https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions),
 which is equivalent or OR-amplification when using single probing.  I.e., 
replace [this line of 
code](https://github.com/apache/spark/blob/8f0ea011a7294679ec4275b2fef349ef45b6eb81/mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala#L79)
 with: ```x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == 
pair._2).sum / x.size```
    * RandomProjection
      * Currently: Uses OR-amplification for single probing, and something 
reasonable for multiple probing
    
    @Yunni What is the best resource you have for single vs multiple probing?  
I'm wondering now if they are uncommon terms and should be renamed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to