Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14956#discussion_r78277099
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
    @@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging {
         val points = v.mapValues(x => Vectors.dense(x)).cache()
         val model = new KMeans()
           .setK(k)
    -      .setSeed(0L)
    +      .setSeed(5L)
    --- End diff --
    
    Sorry Matei I kind of missed your point. Yes it's a bit more strange to be 
changing a seed in a non-test file. Same reasoning, but I agree. There's a seed 
here to begin with for determinism but probably shouldn't matter.
    
    I think I understand the problem, and I think it's the test. k-means is 
used to cluster 1D data. The test case generates two concentric circles of 
points of radius 1 and 4, which are intended to form k=2 separate clusters in 
the derived values that are clustered by k-means internally. That's even clear 
from looking at the similarities plotted:
    
    
![rplot](https://cloud.githubusercontent.com/assets/822522/18410522/2de05f76-775d-11e6-8e75-07d3d31e5cae.png)
    
    While it's clear what the clustering is supposed to be, it's not actually 
the lowest-cost k-means clustering. Many clusterings do find the 'wrong' better 
clustering which is one that would include a few of the leftmost elements of 
the right group into the left one. Many other clusterings get the 'right' 
answer which is a big local minimum but not optimal. In fact, k-means|| init 
seems to do worse than random here exactly because it's less likely to find the 
local minimum.
    
    I don't think the choice of radii matters here, since the resulting values 
above are basically invariant.
    
    I'm going to have to read the paper more to understand what the difference 
is here. It's not quite the k-means change here, and, we can make this test 
pass easily by either
    
    - Set seed back to 0 and fix init steps = 5 for this use of k-means, 
because that happens to work. Then this implementation doesn't change at all. 
It means it does more work just to make the test pass.
    - Set seed to, say, 5 to get this to pass, on the theory that the choice of 
seed still doesn't seem to matter per se, and 5 is no worse than 0.
    
    Obviously I want to understand a little more about how this is ever 
supposed to work in PIC, though it ends up being a slightly different issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to