Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-55971460 To understand and evaluate this pull request, I would suggest that a reviewer do the following: 1) Look at the `PointOps` trait and its `FastEuclideanOps` implementation to understand its purpose. 2) Look at the `MultiKMeans` class that implements the iterations of Lloyd's algorithm. Confirm that this operates as you would expect. 3) Look at the `KMeansRandom` class. Confirm that it creates a `runs` sets of `k` random cluster centers each. 4) Look at the `KMeansParallel` class. Confirm that it implements the K Means || algorithm and creates `runs` sets of at most `k` cluster centers. 5) Look at the `KmeansPlusPlus` class. Confirm that it implements the K Means ++ algorithm. If the reviewer is familiar with the K Means, K Means ||, K Means ++ algorithms, then I suspect that the code can be thoroughly reviewed in a couple of hours.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org