Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    I tend to agree that the terminology used here is a little confusing, and 
doesn't seem to match up with the "general" terminology (I use that term 
loosely however).
    
    #### Terminology
    
    In my dealings with LSH, I too have tended to come across the version that 
@sethah mentions (and @karlhigley's package, and others such as 
https://github.com/marufaytekin/lsh-spark, implement). that is, each input 
vector is hashed into `L` "tables" of hash signatures of "length" or 
"dimension" `d`. Each hash signature is created by concatenating the result of 
applying `d` "hash functions".
    
    I agree what's effectively implemented here is `L = outputDim` and `d=1`. 
What I find a bit troubling is that it is done "implicitly", as part of the 
`hashDistance` function. Without knowing that is what is happening, it is not 
clear to a new user - coming from other common LSH implementations - that 
`outputDim` is not the "number of hash functions" or "length of the hash 
signatures" but actually the "number of hash tables".
    
    #### Transform semantics
    
    In terms of `transform` - I disagree somewhat that the main use case is 
"dimensionality reduction". Perhaps there are common examples of using the hash 
signatures as a lower-dim representation as a feature in some model (e.g. in a 
similar way to say a PCA transform), but I haven't seen that. In my view, the 
real use case is the approximate nearest neighbour search.
    
    I'll give a concrete example for the `transform` output. Let's say I want 
to export recommendation model factor vectors (from ALS), or Word2Vec vectors, 
etc, to a real-time scoring system. I have many items, so I'd like to use LSH 
to make my scoring feasible. I do this by effectively doing a real-time version 
of OR-amplification. I store the hash tables (`L` tables of `d` hash 
signatures) with my vectors. When doing "similar items" for a given item, I 
retrieve the hash sigs of the query item, and use these to filter down the 
candidate item set for my scoring. This is in fact something I'm working on in 
a demo project currently. So if we will support the OR/AND combo, then it will 
be very important to output the full `L x d` set of hash sigs in `transform`.
    
    #### Proposal:
    
    My recommendation is: 
    
    1. future proof the API by returning `Array[Vector]` in `transform` (as 
mentioned above by others);
    2. we need to update the docs / user guide to make it really clear what the 
implementation is doing;
    3. I think we need to make it clear that the implied `d` value here is `1` 
- we can mention that AND amplification will be implemented later and perhaps 
even link to a JIRA.
    4. rename `outputDim` to something like `numHashTables`.
    5. when we add AND-amp, we can add the parameter `hashSignatureLength` or 
`numHashFunctions`.
    6. make as much private as possible to avoid being stuck with any 
implementation detail in future releases (e.g. I also don't see why 
`randUnitVectors` or `randCoefficients` needs to be public).
    
    One issue I have is that currently we would output a `1 x L` set of hash 
values. But it actually should be `L x 1` i.e. a set of signatures of length 
`1`. I guess we can leave it as is, but document what the output actually is.
    
    I believe we should support OR/AND in future. If so, then to me many things 
need to change - `hashFunction`, `hashDistance` etc will need to be refactored. 
Most of the implementation is private/protected so I think it will be ok. Let's 
just ensure we're not left with an API that we can't change in future. Setting 
`L` and `d=1` must then yield the same result as current impl to avoid a 
behavior change (I guess this will be ok since current default for `L` is `1`, 
and we can make the default for `d` when added also `1`).
    
    Finally, my understanding was results from some performance testing would 
be posted. I don't believe we've seen this yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to