[
https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194481#comment-17194481
]
Cheng Su commented on SPARK-32859:
----------------------------------
Will raise a PR in next couple of days.
> Introduce SQL physical plan rule to decide enable/disable bucketing
> --------------------------------------------------------------------
>
> Key: SPARK-32859
> URL: https://issues.apache.org/jira/browse/SPARK-32859
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Cheng Su
> Priority: Minor
>
> Discussed with [~cloud_fan] offline, it would be better that we can decide
> enable/disable SQL bucketing automatically according to query plan. Currently
> bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true),
> so for all bucketed tables in the query plan, we will use bucket table scan
> (all input files per the bucket will be read by same task). This has the
> drawback that if the bucket table scan is not benefitting at all (no
> join/groupby/etc in the query), we don't need to use bucket table scan as it
> would restrict the # of tasks to be # of buckets and might hurt parallelism.
>
> The proposed change is to introduce a physical plan rule (right before
> `ensureRequirements`):
> (1).transformUp() physical plan, matching SparkPlan operator which is
> FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket
> filter in this case).
> (2).transformUp() physical plan, matching SparkPlan operator which is
> SparkPlanWithInterestingPartitioning.
> SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec,
> ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec,
> SortAggregateExec, etc, which has
> HashClusteredDistribution/ClusteredDistribution in
> requiredChildDistribution}, and its requiredChildDistribution
> HashClusteredDistribution/ClusteredDistribution on its underlying
> FileSourceScanExec's bucketed columns.
> (3).for any child of SparkPlanWithInterestingPartitioning, which does not
> satisfy the plan's requiredChildDistribution: go though the child's sub query
> plan tree.
> if (3.1).all node's outputPartitioning is same as child, and all node's
> requiredChildDistribution is UnspecifiedDistribution.
> and (3.2).the leaf node is FileSourceScanExec on bucketed table and
> and (3.3).if enabling bucket scan for this FileSourceScanExec, the
> outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution
> of SparkPlanWithInterestingPartitioning.
> If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this
> FileSourceScanExec. And double check the new child of
> SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
>
> The idea of SparkPlanWithInterestingPartitioning, is inspired from
> "interesting order" in "Access Path Selection in a Relational Database
> Management
> System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]