spark git commit: [SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset is actually DataFrame

hvanhovell Thu, 23 Aug 2018 19:13:43 -0700

Repository: spark
Updated Branches:
  refs/heads/master b88ddb8a8 -> cd6dff78b



[SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset is 
actually DataFrame

## What changes were proposed in this pull request?
Dataset.apply calls dataset.deserializer (to provide an early error) which ends 
up calling the full Analyzer on the deserializer. This can take tens of 
milliseconds, depending on how big the plan is.
Since Dataset.apply is called for many Dataset operations such as Dataset.where 
it can be a significant overhead for short queries.
According to a comment in the PR that introduced this check, we can at least 
remove this check for DataFrames: 
https://github.com/apache/spark/pull/20402#discussion_r164338267

## How was this patch tested?
Existing tests + manual benchmark

Author: Bogdan Raducanu <bog...@databricks.com>

Closes #22201 from bogdanrdc/deserializer-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cd6dff78
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cd6dff78
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cd6dff78

Branch: refs/heads/master
Commit: cd6dff78be2739fab60487bc3145118208f46b9e
Parents: b88ddb8
Author: Bogdan Raducanu <bog...@databricks.com>
Authored: Fri Aug 24 04:13:07 2018 +0200
Committer: Herman van Hovell <hvanhov...@databricks.com>
Committed: Fri Aug 24 04:13:07 2018 +0200

----------------------------------------------------------------------
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/cd6dff78/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index f65948d..367b985 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -65,7 +65,12 @@ private[sql] object Dataset {
     val dataset = new Dataset(sparkSession, logicalPlan, 
implicitly[Encoder[T]])
     // Eagerly bind the encoder so we verify that the encoder matches the 
underlying
     // schema. The user will get an error if this is not the case.
-    dataset.deserializer
+    // optimization: it is guaranteed that [[InternalRow]] can be converted to 
[[Row]] so
+    // do not do this check in that case. this check can be expensive since it 
requires running
+    // the whole [[Analyzer]] to resolve the deserializer
+    if (dataset.exprEnc.clsTag.runtimeClass != classOf[Row]) {
+      dataset.deserializer
+    }
     dataset
   }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset is actually DataFrame

Reply via email to