Github user viirya commented on the issue: https://github.com/apache/spark/pull/15596 @JoshRosen After few tries, I think to replace `CollectLimitExec` with `GlobalLimitExec` is not a good idea. The main reason is whole stage codegen. Since `GlobalLimitExec` supports whole stage codegen, it will be wrapped in a `WholeStageCodegenExec`. So we will call `executeCollect()` on `WholeStageCodegenExec` wrapping `GlobalLimitExec` when we do `collect()` on `df.limit(1).collect()`, for example. `WholeStageCodegenExec.executeCollect()` is `SparkPlan.executeCollect()` actually. So we will do shuffling and retrieve the results. It doesn't harm to anything, but fails few tests, as the Jenkins test results showed. Of course we can change the tests to fit it. But I don't think it is necessary and good way to do. Another workaround is to override `WholeStageCodegenExec.executeCollect()`. But as @rxin pointed out in previous comment, it is confusing. So based on such facts, I think we better keep `CollectLimitExec` but just remove its shuffling code as I did in initial commit. What do you think?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org