[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

sethah Wed, 02 Nov 2016 12:47:16 -0700

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/14937
  
    @yanboliang I ran the test. The master branch runs in 10 seconds and the 
current patch runs in 6 seconds. Still, the results are meaningless in my 
opinion on such a small dataset. I also ran both branches at larger scale and I 
saw that master branch takes ~20 seconds per iteration in one case while this 
patch takes 10 minutes. I traced it down to the way the data is being copied. 
Could you also run tests at scale to verify this?
    
    Again, with some refactoring I ran some very preliminary tests (data size 
approximately 100gb with 100 - 1k clusters) and saw that this branch can 
improve performance for some cases and degrades it in others. We need to test 
this at scale to really understand the implications I think. I will try to 
summarize my results sometime in the next week. I think we will see performance 
gains when the number of features/clusters is large.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

Reply via email to