Github user HuJiayin commented on the pull request:

    https://github.com/apache/spark/pull/8546#issuecomment-136908702
  
    Kmeans|| is a better algorithms to find the centers, if user has sufficient 
memory, the performance is better. But sometime because of Kmeans parameters 
like K, samples settings and user case related, the algorithms will induce much 
duplication calculation and cause a large memory consumption in each node. 
    
    I tried the 8G executor memory in each node and 10 slaves in yarn mode. 
(The dimension is 20, k=5, and then k=50000 samples=1.2billion remains the 
same.) The task failed with RDD lost after reduced the RDD storage in Kmeans|| 
but can get the result by using old method. When I enlarged the memory to 20G, 
the Kmeans|| achieve the same performance with old method.  
    
    The different between kmeans|| and old method is not significant, If user 
data is small and set a small executor memory. If user set small executor 
memory in each node and have a large number of vm or nodes, I think the 
networking cost will raise. It seems not very practicable. 
     



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to