[GitHub] spark pull request #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with be...

srowen Sat, 10 Sep 2016 06:44:24 -0700

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14956#discussion_r78277526
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
    @@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging {
         val points = v.mapValues(x => Vectors.dense(x)).cache()
         val model = new KMeans()
           .setK(k)
    -      .setSeed(0L)
    +      .setSeed(5L)
    --- End diff --
    
    I got the tests to pass reliably by simply making the two sets of points 
generated in this test both contain 10 points, not 10 and 40. Balancing them 
made the issue go away.
    
    As to why the paper 'works', I'm actually not clear it does. It does not 
actually just k-means cluster the values. They say they run 100 clusterings and 
take the most common cluster assignment. It's a little ambiguous what this 
means, but may be the source of difference. AFAICT the current PIC test does 
present a situation that PIC clustering won't get right, often, if it uses 
straight k-means internally.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with be...

Reply via email to