[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-30 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or i

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67756/ Test PASSed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67756/consoleFull)** for PR 15450 at commit [`f870fe9`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67756/consoleFull)** for PR 15450 at commit [`f870fe9`](https://github.com/apache/spark/commit/f

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67520/ Test PASSed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67520/consoleFull)** for PR 15450 at commit [`79c84ad`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67520/consoleFull)** for PR 15450 at commit [`79c84ad`](https://github.com/apache/spark/commit/7

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 @sethah done. I also removed references to the runs parameter, which has no effect (and was triggering deprecation warnings). I should have done that last time. --- If your project is set up for it

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67512/ Test PASSed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67512/consoleFull)** for PR 15450 at commit [`d1004d9`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67512/consoleFull)** for PR 15450 at commit [`d1004d9`](https://github.com/apache/spark/commit/d

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-24 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 @sethah let me know how you feel about it at this stage --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have t

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67335 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67335/consoleFull)** for PR 15450 at commit [`793e4d5`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67335 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67335/consoleFull)** for PR 15450 at commit [`793e4d5`](https://github.com/apache/spark/commit/7

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-20 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15450 Also, if we're going to make this change, we should document in the ML estimator that the algorithm can return fewer than `k` centers. --- If your project is set up for it, you can reply to this ema

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67256 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67256/consoleFull)** for PR 15450 at commit [`ebebcb9`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67256/ Test PASSed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67256 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67256/consoleFull)** for PR 15450 at commit [`ebebcb9`](https://github.com/apache/spark/commit/e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67188/ Test PASSed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67188/consoleFull)** for PR 15450 at commit [`85c9857`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67188/consoleFull)** for PR 15450 at commit [`85c9857`](https://github.com/apache/spark/commit/8

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-19 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 _k_ is a parameter to the model building process, and I don't think it should change based on the model that comes out. It's the requested or maximum number of centroids, if you like. Or, weigh that

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15450 I don't feel strongly either way, but I don't like the potential of this: scala model.getK scala> 3 model.clusterCenters.length scala> 1 Should we conside

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-18 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 Heh, I believe the PC term is 'corner cases'. I agree. There's not much point in clustering data to k centroids when there are <= k distinct points. I think that's all the more reasons to not make th

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15450 Aren't all these cases sort of non-sensical anyway? What good is performing clustering on a dataset where the result has (approximately) the same number of clusters as unique data points? Th

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-18 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 @sethah I should say I am not trying to handle cases where clusters start separate and converge to nearly the same point. I don't that's something we should even try to do. To elaborate, he

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-17 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15450 The cases you enumerated are the ones I was thinking of. The changes introduced here would alleviate those problems, I agree. What I'm wondering is if this problem still exists in other cases. If Der

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-17 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 @sethah I agree that when there are lots of unique points (>> k) then this is almost certain to not happen, and that's most real-world use cases, but the question indeed is what should happen when th

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-16 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15450 @srowen I'm not against the change per se, I was just hoping to understand how duplicate centers arise. In the case of `initRandom` sampling with replacement makes it possible to select the same init

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67009 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67009/consoleFull)** for PR 15450 at commit [`ab486c1`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67009/ Test PASSed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #67009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67009/consoleFull)** for PR 15450 at commit [`ab486c1`](https://github.com/apache/spark/commit/a

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15450 @sethah I wanted to check how strongly against this kind of change you might be, and continue to discussion here. --- If your project is set up for it, you can reply to this email and have your repl

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15450 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66816/ Test FAILed. ---

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-12 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #66816 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66816/consoleFull)** for PR 15450 at commit [`42279b8`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

2016-10-12 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15450 **[Test build #66816 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66816/consoleFull)** for PR 15450 at commit [`42279b8`](https://github.com/apache/spark/commit/4