Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    +1 for using Dataframe-based version code.
    
    @zhengruifeng One thing I want to confirm is that, I check your testing 
code, both RDD-based version and Dataframe-based version code will both cost on 
deserialization:
    ```
    ...
    val df = spark.createDataFrame(rows, struct)
    df.persist()
    df.count()
    ...
    // do `imputer.fit`
    ```
    when running `imputer.fit`, it will extract the required columns from the 
cached input dataframe, and then you compare the perf between `RDD.aggregate` 
and `dataframe avg`, they both need to deserialize data from input and then do 
computation, and `dataframe avg` will take advantage of codegen and should be 
faster. But here the test show that RDD version is slower than Dataframe 
version, it is not very reasonable, so I want to confirm:
    in your RDD version testing, do you cache again when get `RDD` from the 
input `Dataframe`?
    If not, your testing has no problem, I will guess there exists other 
performance issue in SQL layer and cc @cloud-fan to take a look.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to