Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 Thank you for the performance numbers @WeichenXu123 , I have a couple of comments: - you say that SQL uses adaptive compaction. How bad is that? I assume it adds some overhead. - did you just run each experiment once? I would be interested in error bars on these numbers, as it can take up to 30 seconds for the JVM to warm up and optimize the byte code. You should report the geometric mean or the median time of running these experiments to make sure that you are skewed by outliers. Some others will probably have some good advice as well. - from the performance numbers, there are 2 different regimes: small vectors, and big vectors (for which even the DataFrame -> RDD conversion is faster than working straight with DataFrames). I would be curious to know the bottlenecks for each case. If we trust these numbers, the overall conclusion is that the SQL interface adds a 2x-3x performance overhead over RDDs for the time being. @cloud-fan @liancheng are there still some low hanging fruits that could be merged into SQL? This state of affair is of course far from great, but I am in favor of merging this piece and improve it iteratively with the help of the SQL team, as this code is easy to benchmark and representative of the rest of MLlib, once we start to rely more on dataframe and catalysts, and less on RDDs. @yanboliang @viirya @kiszk what are your thoughts?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org