Sean Zhong created SPARK-16907:
----------------------------------

             Summary: Parquet table reading performance regression when 
vectorized record reader is not used
                 Key: SPARK-16907
                 URL: https://issues.apache.org/jira/browse/SPARK-16907
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Sean Zhong


For this parquet reading benchmark, Spark 2.0 is 20%-30% slower than Spark 1.6.

{code}
// Test Env: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, Intel SSD SC2KW24
// Generates parquet table with nested columns
spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")

def time[R](block: => R): Long = {
    val t0 = System.nanoTime()
    val result = block    // call-by-name
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
    (t1 - t0)/1000000
}

val x = ((0 until 20).toList.map(x => 
time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to