Github user viirya commented on the issue: https://github.com/apache/spark/pull/21109 Thanks for working on this. Based on the description on JIRA, I think the main cause of the bad performance is re-calculation an expensive function on matches rows. With the added benchmark, I adjust the order of conditions so the expensive UDF is put at the end of predicate. Below is the results. The first one is original benchmark. The second is the one with UDF at the end of predicate. ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.9.87-linuxkit-aufs Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz sort merge inner range join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ sort merge inner range join wholestage off 6913 / 6964 0.0 1080112.4 1.0X sort merge inner range join wholestage on 2094 / 2224 0.0 327217.4 3.3X ``` ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.9.87-linuxkit-aufs Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz sort merge inner range join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ sort merge inner range join wholestage off 675 / 704 0.0 105493.9 1.0X sort merge inner range join wholestage on 374 / 398 0.0 58359.6 1.8X ``` It can be easily improved because short-circuit evaluation of predicate. This can be applied to also other conditions other than just range comparison. So I'm thinking if we need a way to give a hint to Spark to adjust the order of expression for an expensive one like UDF.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org