[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468891#comment-15468891
 ] 

Srinath commented on SPARK-16026:
---------------------------------

I have a couple of comments/questions on the proposal.
Regarding the join reordering algorithm: 
One of the big wins we should be able to get is avoiding shuffles/broadcasts. 
If the costing and dynamic programming algo doesn't take into account change 
costs and output partitioning we may produce some bad plans.
Here's an example: Suppose we start with completely unpartitioned tables A(a), 
B(b1, b2), C(c) and D(d), in increasing order of size and let's assume none of 
them are small enough to broadcast. Suppose we want to optimize the following 
join 
(A join B on A.a = B.b1) join C on (B.b2 = C.c) join D on (B.b1 = D.d).
Since A, B C and D are in increasing order of size and we try to minimize 
intermediate result size, we end up with the following “cheapest” plan (join 
order A-B-C-D):
{noformat}
Plan I
Join(B.b1 = D.d)
|-Exchange(b1)
|   Join(B.b2 = c)
|   |-Exchange(b2)
|   |   Join(A.a = B.b1)
|   |   |-Exchange(a)
|   |   |   A
|   |   | Exchange(b1)
|   |       B
|   | Exchange(c)
|       C
|-Exchange(d)
    D
{noformat}
Ignoring leaf node sizes, the cost according to the proposed model, i.e. the 
intermediate data size is Size(A join B) + size(ABC). This is also the size of 
intermediate data exchanged.
But a better plan may be to join to D before C (i.e. join order A-B-D-C) 
because that would avoid a re-shuffle 
{noformat}
Plan II
Join(B.b2 = C.c)
|-Exchange(B.b2)
|   Join (B.b1 = d)
|   |-Join(A.a = B.b1)
|   | |-Exchange(a)
|   | |   A
|   | | Exchange(b1)
|   |     B  
|   |-Exchange(d)
|       D
|-Exchange(c)
    C
{noformat}
The cost of this plan, i.e. the intermediate data size, is size(AB) + 
size(ABD), which is higher than Plan I. But the size of intermediate data 
exchanged is  size(ABD) which may be lower than size(AB) + size(ABC) of Plan I. 
This plan could be significantly faster as a result.

It should be relatively painless to incorporate partition-awareness into the 
dynamic programming proposal for cost-based join ordering — with a couple of 
tweaks
i) Take into account intermediate data exchanged, not just total intermediate 
data. For example, a good and simple start would be to use (exchanged-data, 
total-data) as the cost function, with a preference for the former (i.e. prefer 
lower exchanged data, and lower total-data if the exchanged data is the same). 
You could certainly have a more complex model, though. 
ii) Preserve (i.e. don't prune) partial plans based on output partitioning. 
e.g. consider a partial plan involving A, B and C. A join B join C may have a 
different output partitioning than A join C join B. If ACB is more expensive 
but has an output partitioning scheme that is useful for further joins, its 
worth preserving.

Another question I have is regarding statistics: With separate analyze 
column/analyze table statements it's possible for your statistics to have two 
different views of data, leading to weird results and inconsistent cardinality 
estimates.

For filter factor, what are the default selectivities assumed ? We may also 
want to cap the minimum selectivity, so that C1 && C2 && C3 etc. doesn’t lead 
to ridiculously low cardinality estimates.

> Cost-based Optimizer framework
> ------------------------------
>
>                 Key: SPARK-16026
>                 URL: https://issues.apache.org/jira/browse/SPARK-16026
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>         Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to