[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

thunterdb Tue, 01 Aug 2017 15:02:12 -0700

Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/18798
  
    Thank you for the performance numbers @WeichenXu123 , I have a couple of 
comments:
     - you say that SQL uses adaptive compaction. How bad is that? I assume it 
adds some overhead.
     - did you just run each experiment once? I would be interested in error 
bars on these numbers, as it can take up to 30 seconds for the JVM to warm up 
and optimize the byte code. You should report the geometric mean or the median 
time of running these experiments to make sure that you are skewed by outliers. 
Some others will probably have some good advice as well.
     - from the performance numbers, there are 2 different regimes: small 
vectors, and big vectors (for which even the DataFrame -> RDD conversion is 
faster than working straight with DataFrames). I would be curious to know the 
bottlenecks for each case.
    
    If we trust these numbers, the overall conclusion is that the SQL interface 
adds a 2x-3x performance overhead over RDDs for the time being. @cloud-fan 
@liancheng are there still some low hanging fruits that could be merged into 
SQL? 
    
    This state of affair is of course far from great, but I am in favor of 
merging this piece and improve it iteratively with the help of the SQL team, as 
this code is easy to benchmark and representative of the rest of MLlib, once we 
start to rely more on dataframe and catalysts, and less on RDDs.
    
    @yanboliang @viirya @kiszk what are your thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

Reply via email to