Repository: spark Updated Branches: refs/heads/master b88ddb8a8 -> cd6dff78b
[SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset is actually DataFrame ## What changes were proposed in this pull request? Dataset.apply calls dataset.deserializer (to provide an early error) which ends up calling the full Analyzer on the deserializer. This can take tens of milliseconds, depending on how big the plan is. Since Dataset.apply is called for many Dataset operations such as Dataset.where it can be a significant overhead for short queries. According to a comment in the PR that introduced this check, we can at least remove this check for DataFrames: https://github.com/apache/spark/pull/20402#discussion_r164338267 ## How was this patch tested? Existing tests + manual benchmark Author: Bogdan Raducanu <bog...@databricks.com> Closes #22201 from bogdanrdc/deserializer-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cd6dff78 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cd6dff78 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cd6dff78 Branch: refs/heads/master Commit: cd6dff78be2739fab60487bc3145118208f46b9e Parents: b88ddb8 Author: Bogdan Raducanu <bog...@databricks.com> Authored: Fri Aug 24 04:13:07 2018 +0200 Committer: Herman van Hovell <hvanhov...@databricks.com> Committed: Fri Aug 24 04:13:07 2018 +0200 ---------------------------------------------------------------------- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/cd6dff78/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index f65948d..367b985 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -65,7 +65,12 @@ private[sql] object Dataset { val dataset = new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) // Eagerly bind the encoder so we verify that the encoder matches the underlying // schema. The user will get an error if this is not the case. - dataset.deserializer + // optimization: it is guaranteed that [[InternalRow]] can be converted to [[Row]] so + // do not do this check in that case. this check can be expensive since it requires running + // the whole [[Analyzer]] to resolve the deserializer + if (dataset.exprEnc.clsTag.runtimeClass != classOf[Row]) { + dataset.deserializer + } dataset } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org