GitHub user Dooyoung-Hwang opened a pull request:

    https://github.com/apache/spark/pull/22347

    [SPARK-25353][SQL] Refactoring executeTake(n: Int) in SparkPlan

    ## What changes were proposed in this pull request?
    In some cases, executeTake in SparkPlan could deserialize more than 
necessary.
    
    For example, df.limit(1000).collect() is executed.
     -> executeTake in SparkPlan is called with arg 1000.
     -> If total rows count from partitions is 2000, executeTake deserialize 
them and create array of InternalRow whose size is 2000.
     -> Slice the first 1000 rows, and return them. 1000 rows in the rear are 
deserialized but not used.
    
    Using a view of the scalar collection can ensure that at most n rows are 
deserialized.
    
    ## How was this patch tested?
    Existing unit tests that call limit function of DataFrame.
    
    testOnly *SQLQuerySuite
    testOnly *DataFrameSuite


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Dooyoung-Hwang/spark refactor_execute_take

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22347
    
----
commit a8f14817ce3f52f710c3341148c2e1f3374335eb
Author: Dooyoung Hwang <dooyoung.hwang@...>
Date:   2018-09-06T07:49:17Z

    Refactoring executeTake(n: Int) in SparkPlan
    
    Using view of scala collections, deserialize rows at most n times.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to