Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 @MLnick I agree with most of your comments. A few responses: > In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that. This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now. > I also don't see why randUnitVectors or randCoefficients needs to be public You mentioned people using LSH outside of Spark for serving. In order to do that, we will need to expose randUnitVectors and randCoefficients so that users can compute hash values for query points. That said, I'm fine with making those private for now and preventing this use case for 1 release while we stabilize the API. > One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is. What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1. > Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet. You can see some results linked from the JIRA.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org