GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/19601
[SPARK-22383][SQL] Generate code to directly get value of primitive type array from ColumnVector for table cache ## What changes were proposed in this pull request? This PR generates the Java code to directly get a value for a primitive type array in ColumnVector without using an iterator for table cache (e.g. dataframe.cache). This PR improves runtime performance by eliminating data copy from column-oriented storage to InternalRow in a SpecificColumnarIterator iterator for primitive type. This is a follow-up PR of #18747. Benchmark result: **21.4x** ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz Filter for int primitive array with cache: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ InternalRow codegen 1368 / 1887 23.0 43.5 1.0X Filter for int primitive array with cache: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ ColumnVector codegen 64 / 90 488.1 2.0 1.0X ``` Benchmark program ``` intArrayBenchmark(sqlContext, 1024 * 1024 * 20) def intArrayBenchmark(sqlContext: SQLContext, values: Int, iters: Int = 20): Unit = { import sqlContext.implicits._ val benchmarkPT = new Benchmark("Filter for int primitive array with cache", values, iters) val df = sqlContext.sparkContext.parallelize(0 to ROWS, 1) .map(i => Array.range(i, values)).toDF("a").cache df.count // force to create df.cache val str = "ColumnVector" var c: Long = 0 benchmarkPT.addCase(s"$str codegen") { iter => c += df.filter(s"a[${values/2}] % 10 = 0").count } benchmarkPT.run() df.unpersist(true) System.gc() } ``` ## How was this patch tested? Added test cases into `ColumnVectorSuite`, `DataFrameTungstenSuite`, and `WholeStageCodegenSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-22383 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19601.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19601 ---- commit 80b9e319211765807766e5cf70e995bdbbebf22e Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2017-10-29T03:28:06Z initial commit ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org