Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15148 I tend to agree that the terminology used here is a little confusing, and doesn't seem to match up with the "general" terminology (I use that term loosely however). #### Terminology In my dealings with LSH, I too have tended to come across the version that @sethah mentions (and @karlhigley's package, and others such as https://github.com/marufaytekin/lsh-spark, implement). that is, each input vector is hashed into `L` "tables" of hash signatures of "length" or "dimension" `d`. Each hash signature is created by concatenating the result of applying `d` "hash functions". I agree what's effectively implemented here is `L = outputDim` and `d=1`. What I find a bit troubling is that it is done "implicitly", as part of the `hashDistance` function. Without knowing that is what is happening, it is not clear to a new user - coming from other common LSH implementations - that `outputDim` is not the "number of hash functions" or "length of the hash signatures" but actually the "number of hash tables". #### Transform semantics In terms of `transform` - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that. In my view, the real use case is the approximate nearest neighbour search. I'll give a concrete example for the `transform` output. Let's say I want to export recommendation model factor vectors (from ALS), or Word2Vec vectors, etc, to a real-time scoring system. I have many items, so I'd like to use LSH to make my scoring feasible. I do this by effectively doing a real-time version of OR-amplification. I store the hash tables (`L` tables of `d` hash signatures) with my vectors. When doing "similar items" for a given item, I retrieve the hash sigs of the query item, and use these to filter down the candidate item set for my scoring. This is in fact something I'm working on in a demo project currently. So if we will support the OR/AND combo, then it will be very important to output the full `L x d` set of hash sigs in `transform`. #### Proposal: My recommendation is: 1. future proof the API by returning `Array[Vector]` in `transform` (as mentioned above by others); 2. we need to update the docs / user guide to make it really clear what the implementation is doing; 3. I think we need to make it clear that the implied `d` value here is `1` - we can mention that AND amplification will be implemented later and perhaps even link to a JIRA. 4. rename `outputDim` to something like `numHashTables`. 5. when we add AND-amp, we can add the parameter `hashSignatureLength` or `numHashFunctions`. 6. make as much private as possible to avoid being stuck with any implementation detail in future releases (e.g. I also don't see why `randUnitVectors` or `randCoefficients` needs to be public). One issue I have is that currently we would output a `1 x L` set of hash values. But it actually should be `L x 1` i.e. a set of signatures of length `1`. I guess we can leave it as is, but document what the output actually is. I believe we should support OR/AND in future. If so, then to me many things need to change - `hashFunction`, `hashDistance` etc will need to be refactored. Most of the implementation is private/protected so I think it will be ok. Let's just ensure we're not left with an API that we can't change in future. Setting `L` and `d=1` must then yield the same result as current impl to avoid a behavior change (I guess this will be ok since current default for `L` is `1`, and we can make the default for `d` when added also `1`). Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org