Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/16633 Hi @viirya , the main concern of @scwf is that, we can't afford performance regression in any customer scenarios. I think you can understand that :) I went through the discussion above, it seems we've had some solution for both cases you mentioned [here](https://github.com/apache/spark/pull/16633#issuecomment-273963150), then talking points becomes the following two: 1. how to decide the threshold of the two cases; 2. rdd chain is broken. Let's wait @rxin 's comment on the second point. Here I'm just interested in the first one. One possible way to get the number is to modify the mapoutput statistics suggested by @scwf . For cbo, if the computing logic before limit is complex, it's hard to get an accurate estimation. E.g. joins from filtered tables, where join keys and filter keys are probably different (that'll need column correlation info). As you mentioned we can get an estimated number and confidence, can you describe how?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org