Cheng Su created SPARK-32859:
--------------------------------
Summary: Introduce SQL physical plan rule to decide enable/disable
bucketing
Key: SPARK-32859
URL: https://issues.apache.org/jira/browse/SPARK-32859
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su
Discussed with [~cloud_fan] offline, it would be better that we can decide
enable/disable SQL bucketing automatically according to query plan. Currently
bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true),
so for all bucketed tables in the query plan, we will use bucket table scan
(all input files per the bucket will be read by same task). This has the
drawback that if the bucket table scan is not benefitting at all (no
join/groupby/etc in the query), we don't need to use bucket table scan as it
would restrict the # of tasks to be # of buckets and might hurt parallelism.
The proposed change is to introduce a physical plan rule (right before
`ensureRequirements`).
(1).transformUp() physical plan, matching SparkPlan operator which is
FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket
filter in this case).
(2).transformUp() physical plan, matching SparkPlan operator which is
SparkPlanWithInterestingPartitioning.
SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec,
ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec,
SortAggregateExec, etc, which has
HashClusteredDistribution/ClusteredDistribution in requiredChildDistribution},
and its requiredChildDistribution
HashClusteredDistribution/ClusteredDistribution on its underlying
FileSourceScanExec's bucketed columns.
(3).for any child of SparkPlanWithInterestingPartitioning, which does not
satisfy the plan's requiredChildDistribution: go though the child's sub query
plan tree.
if (3.1).all node's outputPartitioning is same as child, and all node's
requiredChildDistribution is UnspecifiedDistribution.
and (3.2).the leaf node is FileSourceScanExec on bucketed table and
and (3.3).if enabling bucket scan for this FileSourceScanExec, the
outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution of
SparkPlanWithInterestingPartitioning.
If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this
FileSourceScanExec. And double check the new child of
SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
The idea of SparkPlanWithInterestingPartitioning, is inspired from "interesting
order" in "Access Path Selection in a Relational Database Management
System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]