Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17138
  
    I'd start with my definition of a cost-based optimizer (CBO). Cost-based 
optimizer is a process where an optimal execution plan is chosen based on its 
estimated cost of execution. The estimated cost can be as crude as (1) a 
function of the estimated number of output rows from each operation in the 
plan, or as sophisticated as (2) modelling the execution cost of each operator.
    
    Cardinality estimation is a foundation of a CBO but if we really want to 
model the execution cost, it will need to calculate at the point we pick an 
execution method, such as what join methods are eligible for such logical 
operation, either a hash-based aggregation or a sort-based aggregation; either 
a hash join, a sorted merge join, or a nested loop join; in the case of a join, 
which table is on the inner, which on the outer; whether we want to reshuffle 
the input or broadcast the input. It can go even further that which properties 
of the output are desired properties of the output that may be beneficial to 
the next operator.
    
    An example is
    
     (T1 join T2 on T1.c1 = T2.c1 join T3 on T1.c2 = T3.c2) group by T1.c1
    
    Do we want to keep the plan for the joins of {T1, T2, T3} so that it will 
preserve the order of T1.c1 so that the subsequent aggregate comes virtually 
free? Or will we just pick the lowest cost of joins but lose the order and then 
do a hash aggregate?
    
    If the goal of Spark is to do (2) eventually, it is prudent to lay a good 
foundation by doing any CBO functionality in the Physical Planning phase.
    
    On a slightly off topic, cardinality estimate is useful even in rule-based 
optimization in Logical Optimization. Rule-based optimization can do a better 
if it has access to cardinality estimation, especially in some advanced 
optimization rules.
    
    An example is
    
    (small_tab T1 left join big_tab T2 on T1.c1 = T2.c1) group by T2.c1
    
    if we know that T1.c1 from the small_tab will filter a lot of rows from 
big_tab, then a desired rewritten query would be
    
    <some compensation> (  ( small_tab T1 left join big_tab T2 ) group by T2.c1 
)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to