Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 A few high-level comments/questions: * Should this go into the `feature` package as a feature estimator/transformer? That is where other dimensionality reduction techniques have gone and I'm not sure we should create a new package for this. * Could you please point me to a specific section of a specific paper that documents the approaches used here? AFAICT, this patch implements something different than most of the Approximate nearest neighbors via LSH algorithms found in papers. For instance, the method in section 2 [here](http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf) as well as the method on Wikipedia [here](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search) are different than the implementation in this pr. Also, the spark package [`spark-neighbors`](https://github.com/sethah/spark-neighbors) employs those approaches. I'm not an expert in LSH so I was just hoping for some clarification. * The implementation of the `RandomProjections` class actually follows the implementation of the "2-stable" (or more generically, "p-stable") LSH algorithm, and not the "Random Projection" algorithm in the paper that is referenced. At the very least, we should clarify this. Potentially, we should think of a better name. @karlhigley Would you mind taking a look at the patch, or providing your input on the comments?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org