[GitHub] spark issue #21109: [SPARK-24020][SQL] Sort-merge join inner range optimizat...

viirya Fri, 27 Apr 2018 20:07:09 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21109
  
    Thanks for working on this.
    
    Based on the description on JIRA, I think the main cause of the bad 
performance is re-calculation an expensive function on matches rows. With the 
added benchmark, I adjust the order of conditions so the expensive UDF is put 
at the end of predicate. Below is the results. The first one is original 
benchmark. The second is the one with UDF at the end of predicate.
    
    
    ```
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 
4.9.87-linuxkit-aufs
    Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    sort merge inner range join:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    sort merge inner range join wholestage off      6913 / 6964          0.0    
 1080112.4       1.0X
    sort merge inner range join wholestage on      2094 / 2224          0.0     
 327217.4       3.3X
    ```
    
    ```
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 
4.9.87-linuxkit-aufs
    Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    sort merge inner range join:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    sort merge inner range join wholestage off       675 /  704          0.0    
  105493.9       1.0X
    sort merge inner range join wholestage on       374 /  398          0.0     
  58359.6       1.8X
    ```
    
    It can be easily improved because short-circuit evaluation of predicate. 
This can be applied to also other conditions other than just range comparison. 
So I'm thinking if we need a way to give a hint to Spark to adjust the order of 
expression for an expensive one like UDF.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21109: [SPARK-24020][SQL] Sort-merge join inner range optimizat...

Reply via email to