[GitHub] spark pull request #22364: [SPARK-25379][SQL] Improve AttributeSet and Colum...

mgaido91 Sat, 08 Sep 2018 02:32:23 -0700

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22364


    [SPARK-25379][SQL] Improve AttributeSet and ColumnPruning performance

    ## What changes were proposed in this pull request?
    
    This PR contains 3 optimizations:
     1)  it improves significantly the operation `--` on `AttributeSet`. As a 
benchmark for the `--` operation, the following code has been run
    ```
    test("AttributeSet -- benchmark") {
        val attrSetA = AttributeSet((1 to 100).map { i => 
AttributeReference(s"c$i", IntegerType)() })
        val attrSetB = AttributeSet(attrSetA.take(80).toSeq)
        val attrSetC = AttributeSet((1 to 100).map { i => 
AttributeReference(s"c2_$i", IntegerType)() })
        val attrSetD = AttributeSet((attrSetA.take(50) ++ 
attrSetC.take(50)).toSeq)
        val attrSetE = AttributeSet((attrSetC.take(50) ++ 
attrSetA.take(50)).toSeq)
        val n_iter = 1000000
        val t0 = System.nanoTime()
        (1 to n_iter) foreach { _ =>
          val r1 = attrSetA -- attrSetB
          val r2 = attrSetA -- attrSetC
          val r3 = attrSetA -- attrSetD
          val r4 = attrSetA -- attrSetE
        }
        val t1 = System.nanoTime()
        val totalTime = t1 - t0
        println(s"Average time: ${totalTime / n_iter} us")
      }
    ```
    The results are:
    ```
    Before PR - Average time: 67674 us (100  %)
    After PR -  Average time: 28827 us (42.6 %)
    ```
    2) In `ColumnPruning`, it replaces the occurrences of `(attributeSet1 -- 
attributeSet2).nonEmpty` with `attributeSet1.subsetOf(attributeSet2)` which is 
order of magnitudes more efficient (especially where there are many 
attributes). Running the previous benchmark replacing `--` with `subsetOf` 
returns:
    ```
    Average time: 67 us (0.1 %)
    ```
    
    3) Provides a more efficient way of building `AttributeSet`s, which can 
greatly improve the performance of the methods `references` and `outputSet` of 
`Expression` and `QueryPlan`. This basically avoids unneeded operations (eg. 
creating many `AttributeEqual` wrapper classes which could be avoided)
    
    The overall effect of those optimizations has been tested on 
`ColumnPruning` with the following benchmark:
    
    ```
    test("ColumnPruning benchmark") {
        val attrSetA = (1 to 100).map { i => AttributeReference(s"c$i", 
IntegerType)() }
        val attrSetB = attrSetA.take(80)
        val attrSetC = attrSetA.take(20).map(a => Alias(Add(a, Literal(1)), 
s"${a.name}_1")())
    
        val input = LocalRelation(attrSetA)
        val query1 = Project(attrSetB, Project(attrSetA, input)).analyze
        val query2 = Project(attrSetC, Project(attrSetA, input)).analyze
        val query3 = Project(attrSetA, Project(attrSetA, input)).analyze
        val nIter = 100000
        val t0 = System.nanoTime()
        (1 to nIter).foreach { _ =>
          ColumnPruning(query1)
          ColumnPruning(query2)
          ColumnPruning(query3)
        }
        val t1 = System.nanoTime()
        val totalTime = t1 - t0
        println(s"Average time: ${totalTime / nIter} us")
    }
    ```
    
    The output of the test is:
    
    ```
    Before PR - Average time: 733471 us (100  %)
    After PR  - Average time: 362455 us (49.4 %)
    ```
    
    The performance improvement has been evaluated also on the 
`SQLQueryTestSuite`'s queries:
    
    ```
    (before) org.apache.spark.sql.catalyst.optimizer.ColumnPruning              
                                518413198 / 1377707172                          
2756 / 15717                                   
    (after)  org.apache.spark.sql.catalyst.optimizer.ColumnPruning              
                                415432579 / 1121147950                          
2756 / 15717                                   
    % Running time                                                              
                                    80.1% / 81.3%
    ```
    
    Also other rules benefit especially from (3), despite the impact is lower, 
eg:
    ```
    (before) org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences  
                                307341442 / 623436806                           
2154 / 16480                                   
    (after)  org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences  
                                290511312 / 560962495                           
2154 / 16480                                   
    % Running time                                                              
                                    94.5% / 90.0%
    ```
    
    The reason why the impact on the `SQLQueryTestSuite`'s queries is lower 
compared to the other benchmark is that the optimizations are more significant 
when the number of attributes involved is higher. Since in the tests we often 
have very few attributes, the effect there is lower.
    
    ## How was this patch tested?
    
    run benchmarks + existing UTs


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-25379

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22364
    
----
commit 14edbe6a2fe8fab7131777302024b47ed19da513
Author: Marco Gaido <marcogaido91@...>
Date:   2018-09-07T18:30:49Z

    [SPARK-25379][SQL] Improve AttributeSet and ColumnPruning performance

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22364: [SPARK-25379][SQL] Improve AttributeSet and Colum...

Reply via email to