Github user HuJiayin commented on the pull request: https://github.com/apache/spark/pull/8546#issuecomment-136908702 Kmeans|| is a better algorithms to find the centers, if user has sufficient memory, the performance is better. But sometime because of Kmeans parameters like K, samples settings and user case related, the algorithms will induce much duplication calculation and cause a large memory consumption in each node. I tried the 8G executor memory in each node and 10 slaves in yarn mode. (The dimension is 20, k=5, and then k=50000 samples=1.2billion remains the same.) The task failed with RDD lost after reduced the RDD storage in Kmeans|| but can get the result by using old method. When I enlarged the memory to 20G, the Kmeans|| achieve the same performance with old method. The different between kmeans|| and old method is not significant, If user data is small and set a small executor memory. If user set small executor memory in each node and have a large number of vm or nodes, I think the networking cost will raise. It seems not very practicable.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org