git commit: [SPARK-2443][SQL] Fix slow read from partitioned tables

marmbrus Mon, 14 Jul 2014 13:23:20 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.0 baf92a0f2 -> 2ec7d7ab7



[SPARK-2443][SQL] Fix slow read from partitioned tables

This fix obtains a comparable performance boost as [PR 
#1390](https://github.com/apache/spark/pull/1390) by moving an array update and 
deserializer initialization out of a potentially very long loop. Suggested by 
yhuai. The below results are updated for this fix.

## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The data 
is loaded as a table through Hive. Results are obtained on my local machine 
using hive/console.

Without the fix:

Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)

With this fix:

Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.57s (1.46s) | 11.0s (1.69s)
Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)

Author: Zongheng Yang <zonghen...@gmail.com>

Closes #1408 from concretevitamin/slow-read-2 and squashes the following 
commits:

d86e437 [Zongheng Yang] Move update & initialization out of potentially long 
loop.

(cherry picked from commit d60b09bb60cff106fa0acddebf35714503b20f03)
Signed-off-by: Michael Armbrust <mich...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2ec7d7ab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2ec7d7ab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2ec7d7ab

Branch: refs/heads/branch-1.0
Commit: 2ec7d7ab751be67a86a048eed85bd9fd36dfaf83
Parents: baf92a0
Author: Zongheng Yang <zonghen...@gmail.com>
Authored: Mon Jul 14 13:22:24 2014 -0700
Committer: Michael Armbrust <mich...@databricks.com>
Committed: Mon Jul 14 13:22:39 2014 -0700

----------------------------------------------------------------------
 .../scala/org/apache/spark/sql/hive/TableReader.scala     | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/2ec7d7ab/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
----------------------------------------------------------------------
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
index 8cfde46..c394257 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
@@ -164,13 +164,17 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, 
@transient sc: HiveCon
       hivePartitionRDD.mapPartitions { iter =>
         val hconf = broadcastedHiveConf.value.value
         val rowWithPartArr = new Array[Object](2)
+
+        // The update and deserializer initialization are intentionally
+        // kept out of the below iter.map loop to save performance.
+        rowWithPartArr.update(1, partValues)
+        val deserializer = localDeserializer.newInstance()
+        deserializer.initialize(hconf, partProps)
+
         // Map each tuple to a row object
         iter.map { value =>
-          val deserializer = localDeserializer.newInstance()
-          deserializer.initialize(hconf, partProps)
           val deserializedRow = deserializer.deserialize(value)
           rowWithPartArr.update(0, deserializedRow)
-          rowWithPartArr.update(1, partValues)
           rowWithPartArr.asInstanceOf[Object]
         }
       }

git commit: [SPARK-2443][SQL] Fix slow read from partitioned tables

Reply via email to