Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14956#discussion_r78277526 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging { val points = v.mapValues(x => Vectors.dense(x)).cache() val model = new KMeans() .setK(k) - .setSeed(0L) + .setSeed(5L) --- End diff -- I got the tests to pass reliably by simply making the two sets of points generated in this test both contain 10 points, not 10 and 40. Balancing them made the issue go away. As to why the paper 'works', I'm actually not clear it does. It does not actually just k-means cluster the values. They say they run 100 clusterings and take the most common cluster assignment. It's a little ambiguous what this means, but may be the source of difference. AFAICT the current PIC test does present a situation that PIC clustering won't get right, often, if it uses straight k-means internally.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org