GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22630
[SPARK-25497][SQL] Limit operation within whole stage codegen should not consume all the inputs ## What changes were proposed in this pull request? This PR is inspired by https://github.com/apache/spark/pull/22524, but picks a more aggressive fix. The current limit whole stage codegen has 2 problems: 1. It's only applied to `InputAdapter`, many leaf nodes can't stop earlier w.r.t. limit. 2. It needs to override a method, which will break if we have more than one limit in the whole-stage. The first problem is easy to fix, just figure out which nodes can stop earlier w.r.t. limit, and update them. This PR updates `RangeExec`, `ColumnarBatchScan`, `SortExec`, `HashAggregateExec` and `SortMergeJoinExec`. The second problem is hard to fix. This PR proposes to propagate the limit counter variable name upstream, so that the upstream leaf/blocking nodes can check the limit counter and quit the loop earlier. For better performance, the implementation here follows `CodegenSupport.needStopCheck`, so that we only codegen the check only if there is limit in the query. For columnar node like range, we check the limit counter per-batch instead of per-row, to make the inner loop tight and fast. ## How was this patch tested? a new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark limit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22630.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22630 ---- commit d9b54d5c6edd4f5337efb2d185dbb58f33972616 Author: Wenchen Fan <wenchen@...> Date: 2018-10-03T00:00:54Z Limit operation within whole stage codegen should not consume all the inputs ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org