[
https://issues.apache.org/jira/browse/SPARK-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zongheng Yang updated SPARK-2393:
-
Description:
Spark SQL should support the common optimization known as cost estimations. For
example, each logical operator should be able to estimate its cardinality,
based on the estimates from its children.
As a first step, the framework to support doing so should be added, which might
include the interface for the aforementioned cardinality estimation, and/or
some other metrics.
As the first proof of concept usage of this optimization, a simple optimization
strategy for certain equi-joins should be added: namely, if one side of a
qualifying join has a small estimated physical size (smaller than some
threshold), the planner should use a broadcast join physical plan to execute
the join, broadcasting the small side and streaming through the bigger side.
Similar concept exists in Hive and is explained
[here|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-OptimizeAutoJoinConversion].
Simple cost estimation and auto selection of broadcast join
---
Key: SPARK-2393
URL: https://issues.apache.org/jira/browse/SPARK-2393
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Critical
Spark SQL should support the common optimization known as cost estimations.
For example, each logical operator should be able to estimate its
cardinality, based on the estimates from its children.
As a first step, the framework to support doing so should be added, which
might include the interface for the aforementioned cardinality estimation,
and/or some other metrics.
As the first proof of concept usage of this optimization, a simple
optimization strategy for certain equi-joins should be added: namely, if one
side of a qualifying join has a small estimated physical size (smaller than
some threshold), the planner should use a broadcast join physical plan to
execute the join, broadcasting the small side and streaming through the
bigger side. Similar concept exists in Hive and is explained
[here|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-OptimizeAutoJoinConversion].
--
This message was sent by Atlassian JIRA
(v6.2#6252)