[jira] [Issue Comment Deleted] (SPARK-16026) Cost-based Optimizer framework

2016-11-01 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-16026:
---
Comment: was deleted

(was: Design Specification of Spark Cost-Based Optimization is now posted.)

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16026) Cost-based Optimizer framework

2016-09-07 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-16026:
---
Comment: was deleted

(was: Hi Srinath,  Thank you for your comments.  Let me answer them one by one. 
 
First, we should consider the data shuffle cost.  Yes, this is part of phase 2 
cost functions in our plan.  As we already implemented the phase 1 cost 
function, we want to contribute our existing development work to Spark 
community ASAP.  We will expand to phase 2 CBO work soon.  In phase 2, we will 
develop cost function for each execution operator.  The EXCHANGE operator is 
one we need to define its cost function.  Your suggestion is quite reasonable.

Second, we define two statements: (1)  ANALYZE TABLE table_name COMPUTE 
STATISTICS; (2) ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, ….  As you know, the ANALYZE TABLE command collects 
the auxiliary statistics information.  A good DBA needs to monitor the status 
of the statistics information.  I mean there always exists an issue whether or 
not the statistics data is stale.  Hence, we do not want to use the transaction 
criteria to view statistics data.   On the other hand, we may do a little 
better to make them consistent.  One way is to refresh table level statistics 
when we execute the command "ANALYZE TABLE table_name COMPUTE STATISTICS FOR 
COLUMNS column-name1, column-name2, …. " to collect column level statistics.  

Third, we do not have default selectivity assumed.  In the design spec, we 
defined how to estimate the cardinality for logical AND operator in section 6.  
In the future, we may use either 2-dimensional histogram and/or SQL hint to 
handle the correlation among multiple correlated columns.  )

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16026) Cost-based Optimizer framework

2016-08-16 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-16026:
---
Comment: was deleted

(was: I will post a pdf version of the design spec soon.)

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec_v_1_2.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16026) Cost-based Optimizer framework

2016-08-16 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-16026:
---
Comment: was deleted

(was: Design Specification of Spark Cost-Based Optimization)

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec_v_1_2.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org