Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 +1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will both cost on deserialization: ``` ... val df = spark.createDataFrame(rows, struct) df.persist() df.count() ... // do `imputer.fit` ``` when running `imputer.fit`, it will extract the required columns from the cached input dataframe, and then you compare the perf between `RDD.aggregate` and `dataframe avg`, they both need to deserialize data from input and then do computation, and `dataframe avg` will take advantage of codegen and should be faster. But here the test show that RDD version is slower than Dataframe version, it is not very reasonable, so I want to confirm: in your RDD version testing, do you cache again when get `RDD` from the input `Dataframe`? If not, your testing has no problem, I will guess there exists other performance issue in SQL layer and cc @cloud-fan to take a look.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org