[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

kiszk Sat, 28 Oct 2017 20:31:10 -0700

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/19601


    [SPARK-22383][SQL] Generate code to directly get value of primitive type 
array from ColumnVector for table cache

    ## What changes were proposed in this pull request?
    
    This PR generates the Java code to directly get a value for a primitive 
type array in ColumnVector without using an iterator for table cache (e.g. 
dataframe.cache). This PR improves runtime performance by eliminating data copy 
from column-oriented storage to InternalRow in a SpecificColumnarIterator 
iterator for primitive type.
    This is a follow-up PR of #18747.
    
    Benchmark result: **21.4x**
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 
4.4.0-22-generic
    Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
    
    Filter for int primitive array with cache: Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    InternalRow codegen                           1368 / 1887         23.0      
    43.5       1.0X
    
    Filter for int primitive array with cache: Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    ColumnVector codegen                            64 /   90        488.1      
     2.0       1.0X
    
    ```
    
    Benchmark program
    ```
      intArrayBenchmark(sqlContext, 1024 * 1024 * 20)
      def intArrayBenchmark(sqlContext: SQLContext, values: Int, iters: Int = 
20): Unit = {
        import sqlContext.implicits._
        val benchmarkPT = new Benchmark("Filter for int primitive array with 
cache", values, iters)
        val df = sqlContext.sparkContext.parallelize(0 to ROWS, 1)
                           .map(i => Array.range(i, values)).toDF("a").cache
        df.count  // force to create df.cache
        val str = "ColumnVector"
        var c: Long = 0
        benchmarkPT.addCase(s"$str codegen") { iter =>
          c += df.filter(s"a[${values/2}] % 10 = 0").count
        }
        benchmarkPT.run()
        df.unpersist(true)
        System.gc()
      }
    ```
    
    ## How was this patch tested?
    
    Added test cases into `ColumnVectorSuite`, `DataFrameTungstenSuite`, and 
`WholeStageCodegenSuite`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-22383

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19601
    
----
commit 80b9e319211765807766e5cf70e995bdbbebf22e
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2017-10-29T03:28:06Z

    initial commit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

Reply via email to