[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14948
  
@yanboliang I'm going to close this PR and instead 'port' a few changes I 
made in this version for your consideration. Yours should be the primary PR for 
removing `runs` I think. I'll break out the two other changes in separate PRs.

You're right that this is the question regarding init steps -- more steps 
could make for a better clustering, which could indirectly mean a faster 
convergence too. Maybe you can check my work. I think the default of 5 was 
taken from Table 6 in 
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 
5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 
as we've chosen here) there's virtually no improvement for more than 2 init 
steps.

Coupled with the fact that an init step now takes about 5x longer than a 
single iteration, it seems like 5 is pretty expensive as a default too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-04 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14948
  
Yep, these changes are not overlapping with #14937 for the most part, and 
they are actually same changes for the overlapping part. Let's make this PR 
focus on the optimization of initialization step and to fix it's likely to 
return duplicate centroids.

For reducing initialization steps to 2 for default k-means||, I wonder the 
impact to the total training iteration number. It will definitely reduce the 
initialization time, but whether it will introduce more training iterations due 
to not good enough initial centers? If it does not introduce extra iterations 
for most cases, I think it's OK. Or we should trade off the initialization and 
training iterations. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14948
  
Ah yeah, it also removes runs from k-means||. The good news is I think 
these changes are actually not overlapping for the most part, and where they 
do, they're essentially the same change.

@yanboliang what do you think of this? This takes out `runs` entirely and I 
think simplifies the code even a bit further. But the real win was reducing the 
default init steps. I'm also here trying to fix the fact that duplicate 
centroids can be returned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14948
  
I think https://github.com/apache/spark/pull/14937 also removes runs. cc 
@yanboliang can we coordinate these PRs? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14948
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14948
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64899/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14948
  
**[Test build #64899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64899/consoleFull)**
 for PR 14948 at commit 
[`e7f12fa`](https://github.com/apache/spark/commit/e7f12fa3e1d3273f558f90455c6c5be8e6a9c8f6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14948
  
**[Test build #64899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64899/consoleFull)**
 for PR 14948 at commit 
[`e7f12fa`](https://github.com/apache/spark/commit/e7f12fa3e1d3273f558f90455c6c5be8e6a9c8f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

2016-09-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14948
  
Note this also now resolves SPARK-3261. This change already means that with 
k-means|| init, fewer than k cluster centers may be returned, which is probably 
correct (and faster). Now random init will also return no duplicate centers, 
and thus < k clusters when the input has size < k.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org