[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

2015-09-20 Thread HuJiayin
Github user HuJiayin closed the pull request at: https://github.com/apache/spark/pull/8546 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

2015-09-20 Thread HuJiayin
Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-141855212 On the other hand, newcenters will cause a sudden increasing of memory usage, though call clear immediately, but i think it waits for GC to clear. Newcenter will still

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

2015-09-17 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-141347838 @HuJiayin This basically reverts the behavior back to 1.2. The changes we made in 1.3 is to avoid recomputing distances between old centers and input points during

[GitHub] spark pull request: SPARK-10329 Cost RDD in k-means|| initializati...

2015-09-16 Thread HuJiayin
Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-140959166 @mengxr I reduced centers storage and deleted the fallback and duplicate codes. I tested the functionality, performance on my local side and works. Could you give me a

[GitHub] spark pull request: SPARK-10329

2015-09-14 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-140283221 @HuJiayin Could you add a description to the PR title? The JIRA number doesn't describe the content. For the implementation, maybe we can avoid duplicating code. How

[GitHub] spark pull request: SPARK-10329

2015-09-06 Thread HuJiayin
Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-138047275 The user doesn't need enlarge the memory to run 1.5 kmeans after apply this fix. They still can use 1.2 configuration and have stable run experience in the same time.

[GitHub] spark pull request: SPARK-10329

2015-09-01 Thread HuJiayin
Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-136908702 Kmeans|| is a better algorithms to find the centers, if user has sufficient memory, the performance is better. But sometime because of Kmeans parameters like K,

[GitHub] spark pull request: SPARK-10329

2015-09-01 Thread HuJiayin
Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-136911191 Users can manually adjust memory when they meet failure but they may cost some time to find the root cause. There are many ways to implement kmeans and they work well

[GitHub] spark pull request: SPARK-10329

2015-09-01 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-136652392 I personally think it's brittle to arbitrarily change behavior based on one executor's memory size. It seems like this suggests you can only use the current

[GitHub] spark pull request: SPARK-10329

2015-09-01 Thread srowen
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/8546#discussion_r38401804 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -246,10 +248,23 @@ class KMeans private ( if

[GitHub] spark pull request: SPARK-10329

2015-09-01 Thread srowen
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/8546#discussion_r38401811 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -246,10 +248,23 @@ class KMeans private ( if

[GitHub] spark pull request: SPARK-10329

2015-08-31 Thread HuJiayin
Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-136566441 cc @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: SPARK-10329

2015-08-31 Thread HuJiayin
GitHub user HuJiayin opened a pull request: https://github.com/apache/spark/pull/8546 SPARK-10329 Kmeans || is better to find centers more efficient based on stochastic processes,et al. But some users with small memory will meet difficulties to run this. The patch will fallback to

[GitHub] spark pull request: SPARK-10329

2015-08-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-136561196 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your