Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14956#discussion_r78277099 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging { val points = v.mapValues(x => Vectors.dense(x)).cache() val model = new KMeans() .setK(k) - .setSeed(0L) + .setSeed(5L) --- End diff -- Sorry Matei I kind of missed your point. Yes it's a bit more strange to be changing a seed in a non-test file. Same reasoning, but I agree. There's a seed here to begin with for determinism but probably shouldn't matter. I think I understand the problem, and I think it's the test. k-means is used to cluster 1D data. The test case generates two concentric circles of points of radius 1 and 4, which are intended to form k=2 separate clusters in the derived values that are clustered by k-means internally. That's even clear from looking at the similarities plotted: ![rplot](https://cloud.githubusercontent.com/assets/822522/18410522/2de05f76-775d-11e6-8e75-07d3d31e5cae.png) While it's clear what the clustering is supposed to be, it's not actually the lowest-cost k-means clustering. Many clusterings do find the 'wrong' better clustering which is one that would include a few of the leftmost elements of the right group into the left one. Many other clusterings get the 'right' answer which is a big local minimum but not optimal. In fact, k-means|| init seems to do worse than random here exactly because it's less likely to find the local minimum. I don't think the choice of radii matters here, since the resulting values above are basically invariant. I'm going to have to read the paper more to understand what the difference is here. It's not quite the k-means change here, and, we can make this test pass easily by either - Set seed back to 0 and fix init steps = 5 for this use of k-means, because that happens to work. Then this implementation doesn't change at all. It means it does more work just to make the test pass. - Set seed to, say, 5 to get this to pass, on the theory that the choice of seed still doesn't seem to matter per se, and 5 is no worse than 0. Obviously I want to understand a little more about how this is ever supposed to work in PIC, though it ends up being a slightly different issue.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org