GitHub user bogdanrdc opened a pull request:

    https://github.com/apache/spark/pull/22201

    [SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset 
is actually DataFrame

    ## What changes were proposed in this pull request?
    Dataset.apply calls dataset.deserializer (to provide an early error) which 
ends up calling the full Analyzer on the deserializer. This can take tens of 
milliseconds, depending on how big the plan is.
    Since Dataset.apply is called for many Dataset operations such as 
Dataset.where it can be a significant overhead for short queries.
    According to a comment in the PR that introduced this check, we can at 
least remove this check for DataFrames: 
https://github.com/apache/spark/pull/20402#discussion_r164338267
    
    ## How was this patch tested?
    Existing tests + manual benchmark

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bogdanrdc/spark deserializer-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22201.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22201
    
----
commit 7089e035253c80bd143f3af4d12f39643e9eaf84
Author: Bogdan Raducanu <bogdan@...>
Date:   2018-08-23T12:11:34Z

    optimization

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to