Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9104#issuecomment-147872104 Micro-benchmark result with TPC-DS (scale-factor 15) `store_sales` table shows a ~12% performance gain. Before: - Round 0: 8133 ms - Round 1: 7799 ms - Round 2: 8010 ms - Round 3: 8009 ms - Round 4: 8223 ms - Average: 8034.8 ms After: - Round 0: 7401 ms - Round 1: 6897 ms - Round 2: 6873 ms - Round 3: 6935 ms - Round 4: 7056 ms - Average: 7032.4 ms Benchmark code (where `ss_sold_date_sk` is an `INT` partitioning column and `ss_sold_time_sk` is an `INT` data column): ```scala import com.google.common.base.Stopwatch def benchmark(runs: Int, warmupRuns: Int = 0)(f: => Unit) { val stopwatch = new Stopwatch() (0 until warmupRuns).foreach { i => f } def run(i: Int) = { stopwatch.reset() stopwatch.start() f stopwatch.stop() val elapsed = stopwatch.elapsedMillis() println(s"Round $i: $elapsed ms") elapsed } val total = (0 until runs).map(i => run(i)).sum.toDouble println(s"Average: ${total / runs} ms") } val path = "file:///Users/lian/tpcds/sf15/store_sales" benchmark(5, 5) { val df = sqlContext.read.parquet(path).selectExpr("ss_sold_time_sk", "ss_sold_date_sk") df.queryExecution.toRdd.foreach(row => ()) } ```
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org