GitHub user Dooyoung-Hwang opened a pull request: https://github.com/apache/spark/pull/22347
[SPARK-25353][SQL] Refactoring executeTake(n: Int) in SparkPlan ## What changes were proposed in this pull request? In some cases, executeTake in SparkPlan could deserialize more than necessary. For example, df.limit(1000).collect() is executed. -> executeTake in SparkPlan is called with arg 1000. -> If total rows count from partitions is 2000, executeTake deserialize them and create array of InternalRow whose size is 2000. -> Slice the first 1000 rows, and return them. 1000 rows in the rear are deserialized but not used. Using a view of the scalar collection can ensure that at most n rows are deserialized. ## How was this patch tested? Existing unit tests that call limit function of DataFrame. testOnly *SQLQuerySuite testOnly *DataFrameSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/Dooyoung-Hwang/spark refactor_execute_take Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22347 ---- commit a8f14817ce3f52f710c3341148c2e1f3374335eb Author: Dooyoung Hwang <dooyoung.hwang@...> Date: 2018-09-06T07:49:17Z Refactoring executeTake(n: Int) in SparkPlan Using view of scala collections, deserialize rows at most n times. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org