Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/16633
  
    Hi @viirya , the main concern of @scwf is that, we can't afford performance 
regression in any customer scenarios. I think you can understand that :)
    
    I went through the discussion above, it seems we've had some solution for 
both cases you mentioned 
[here](https://github.com/apache/spark/pull/16633#issuecomment-273963150), then 
talking points becomes the following two:
    1. how to decide the threshold of the two cases;
    2. rdd chain is broken.
    
    Let's wait @rxin 's comment on the second point. 
    
    Here I'm just interested in the first one.
    One possible way to get the number is to modify the mapoutput statistics 
suggested by @scwf .
    For cbo, if the computing logic before limit is complex, it's hard to get 
an accurate estimation. E.g. joins from filtered tables, where join keys and 
filter keys are probably different (that'll need column correlation info).
    As you mentioned we can get an estimated number and confidence, can you 
describe how?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to