[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14948 @yanboliang I'm going to close this PR and instead 'port' a few changes I made in this version for your consideration. Yours should be the primary PR for removing `runs` I think. I'll break out the two other changes in separate PRs. You're right that this is the question regarding init steps -- more steps could make for a better clustering, which could indirectly mean a faster convergence too. Maybe you can check my work. I think the default of 5 was taken from Table 6 in http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 as we've chosen here) there's virtually no improvement for more than 2 init steps. Coupled with the fact that an init step now takes about 5x longer than a single iteration, it seems like 5 is pretty expensive as a default too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14948 Yep, these changes are not overlapping with #14937 for the most part, and they are actually same changes for the overlapping part. Let's make this PR focus on the optimization of initialization step and to fix it's likely to return duplicate centroids. For reducing initialization steps to 2 for default k-means||, I wonder the impact to the total training iteration number. It will definitely reduce the initialization time, but whether it will introduce more training iterations due to not good enough initial centers? If it does not introduce extra iterations for most cases, I think it's OK. Or we should trade off the initialization and training iterations. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14948 Ah yeah, it also removes runs from k-means||. The good news is I think these changes are actually not overlapping for the most part, and where they do, they're essentially the same change. @yanboliang what do you think of this? This takes out `runs` entirely and I think simplifies the code even a bit further. But the real win was reducing the default init steps. I'm also here trying to fix the fact that duplicate centroids can be returned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14948 I think https://github.com/apache/spark/pull/14937 also removes runs. cc @yanboliang can we coordinate these PRs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14948 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14948 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64899/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14948 **[Test build #64899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64899/consoleFull)** for PR 14948 at commit [`e7f12fa`](https://github.com/apache/spark/commit/e7f12fa3e1d3273f558f90455c6c5be8e6a9c8f6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14948 **[Test build #64899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64899/consoleFull)** for PR 14948 at commit [`e7f12fa`](https://github.com/apache/spark/commit/e7f12fa3e1d3273f558f90455c6c5be8e6a9c8f6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14948 Note this also now resolves SPARK-3261. This change already means that with k-means|| init, fewer than k cluster centers may be returned, which is probably correct (and faster). Now random init will also return no duplicate centers, and thus < k clusters when the input has size < k. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org