[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472000#comment-15472000
 ] 

Ron Hu commented on SPARK-16026:
--------------------------------

Hi Srinath, Thank you for your comments. Let me answer them one by one. 
First, we should consider the data shuffle cost. Yes, this is part of phase 2 
cost functions in our plan. As we already implemented the phase 1 cost 
function, we want to contribute our existing development work to Spark 
community ASAP. We will expand to phase 2 CBO work soon. In phase 2, we will 
develop cost function for each execution operator. The EXCHANGE operator is one 
we need to define its cost function. Your suggestion is quite reasonable.

Second, we define two statements: (1) ANALYZE TABLE table_name COMPUTE 
STATISTICS; (2) ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, …. As you know, the ANALYZE TABLE command collects 
the auxiliary statistics information. A good DBA needs to monitor the status of 
the statistics information. I mean there always exists an issue whether or not 
the statistics data is stale. Hence, we do not want to use the transaction 
criteria to view statistics data. On the other hand, we may do a little better 
to make them consistent. One way is to refresh table level statistics when we 
execute the command "ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, …. " to collect column level statistics. 

Third, we do not have default selectivity assumed. In the design spec, we 
defined how to estimate the cardinality for logical AND operator in section 6. 
In the future, we may use either 2-dimensional histogram and/or SQL hint to 
handle the correlation among multiple correlated columns. 


> Cost-based Optimizer framework
> ------------------------------
>
>                 Key: SPARK-16026
>                 URL: https://issues.apache.org/jira/browse/SPARK-16026
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>         Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to