Github user karlhigley commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    @jkbradley: "Multi-probe" seems like a standard term, and I think this is 
the [original paper](http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf) 
that coined it.
    
    > Terminology: For LSH, "dimensionality" = "number of hash functions" and 
is relevant only for amplification. Do you agree? I have yet to see a hash 
function used for LSH which does not have a discrete set.
    
    I confess that I'm a little confused what you mean by the above. There are 
several relevant dimensionalities: the dimensionality of the input points 
(`x`), the dimensionality of the computed hashes (i.e. the results of applying 
`g(x)`), and the number of hash tables computed (i.e. how many `g(x)` functions 
are applied), which is the dimensionality of AND-amplification (in a sense).
    
    After wrestling with inconsistent terminology for a while, what I settled 
on for spark-neighbors was to refer to `g(x)` as a hash function, the outputs 
of `g(x)` as hashes, the sub-elements of `g(x)` -- `h1(x)` etc. -- as whatever 
made sense for the particular method (e.g. `permutations` for Minhash), and the 
output of each of the L `g(x)` functions as a hash table. While that 
terminology isn't necessarily standard, it helped me identify the common 
concepts across LSH methods clearly enough to build some abstractions around 
them.
    
    Using those terms, the dimensionality of the `g(x)` hash functions and the 
hashes they produce is equivalent to the number of `h(x)` sub-elements they 
contain. I thought of applying OR-amplification as producing multiple hash 
tables by using multiple `g(x)` functions, with a collision in any one hash 
table producing a pair of candidate neighbors.
    
    Does that make any more (or less) sense? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to