Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    @viirya  sorry, I somehow lost your updated benchmark. Yes, it makes sense. 
In the case without any shuffle needed after the union we have about a 2% 
performance regression. I am not sure about the reliability of the tests with 
`sample` as they may return a different number of rows IIUC. Can we remove the 
two sample operations and leave just the filter?
    
    Moreover, I think it would be also interesting to understand how much time 
is spent in collecting for instance. Because if, for instance, the time to 
collect the data to the driver is very high, that the performance regression 
would be much higher in percentage. Though I am not sure how to estimate it 
properly honestly. Do you have any idea about this?
    
    @cloud-fan @kiszk  what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to