Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21498 @viirya sorry, I somehow lost your updated benchmark. Yes, it makes sense. In the case without any shuffle needed after the union we have about a 2% performance regression. I am not sure about the reliability of the tests with `sample` as they may return a different number of rows IIUC. Can we remove the two sample operations and leave just the filter? Moreover, I think it would be also interesting to understand how much time is spent in collecting for instance. Because if, for instance, the time to collect the data to the driver is very high, that the performance regression would be much higher in percentage. Though I am not sure how to estimate it properly honestly. Do you have any idea about this? @cloud-fan @kiszk what do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org