[
https://issues.apache.org/jira/browse/SPARK-57064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Norio Akagi updated SPARK-57064:
--------------------------------
Description:
### What changes were proposed in this pull request?
`DisableUnnecessaryBucketedScan` and `CoalesceBucketsInJoin` pattern-match on
the concrete class `FileSourceScanExec` in several read-only match sites where
only trait-level fields (`bucketedScan`, `relation`,
`optionalNumCoalescedBuckets`) are accessed. The `FileSourceScanLike` trait
already declares
all of these fields, so the matches can safely be widened.
This PR changes 3 match sites from `FileSourceScanExec` to
`FileSourceScanLike`:
- `DisableUnnecessaryBucketedScan.apply` — the `hasBucketedScan` existence
check
- `ExtractJoinWithBuckets.hasScanOperation` — the bucket spec existence check
- `ExtractJoinWithBuckets.getBucketSpec` — the bucket spec extraction
Two match sites that call `.copy()` (a case-class-specific method) are
intentionally left on `FileSourceScanExec`.
### Why are the changes needed?
Third-party columnar execution plugins (Gluten, Comet, RAPIDS) replace
`FileSourceScanExec` with their own scan operators that extend
`FileSourceScanLike`. With the current concrete-class matches, these plugins'
scan operators are invisible to the bucketing rules —
`DisableUnnecessaryBucketedScan`
never finds them and `ExtractJoinWithBuckets` never extracts their bucket
specs.
This is the same class of issue addressed by SPARK-32332 and SPARK-32430 (AQE
hardcoding concrete classes instead of traits), but in the bucketing physical
rules which were not covered by those fixes.
### Does this PR introduce _any_ user-facing change?
No. `FileSourceScanExec` already extends `FileSourceScanLike`, so behavior is
unchanged for vanilla Spark. Plugins that extend `FileSourceScanLike` will now
be recognized by the bucketing rules.
was:
h3. What changes were proposed in this pull request?
\{{DisableUnnecessaryBucketedScan}} and \{{CoalesceBucketsInJoin}}
pattern-match on the concrete class \{{FileSourceScanExec}} in several
read-only match sites where only trait-level fields (\{{bucketedScan}},
\{{relation}}, \{{optionalNumCoalescedBuckets}}) are accessed. The
\{{FileSourceScanLike}} trait
already declares all of these fields, so the matches can safely be widened.
This PR changes 3 match sites from \{{FileSourceScanExec}} to
\{{FileSourceScanLike}}:
- \{{DisableUnnecessaryBucketedScan.apply}} — the \{{hasBucketedScan}}
existence check
- \{{ExtractJoinWithBuckets.hasScanOperation}} — the bucket spec existence
check
- \{{ExtractJoinWithBuckets.getBucketSpec}} — the bucket spec extraction
Two match sites that call \{{.copy()}} (a case-class-specific method) are
intentionally left on \{{FileSourceScanExec}}.
h3. Why are the changes needed?
Third-party columnar execution plugins (Gluten, Comet, RAPIDS) replace
\{{FileSourceScanExec}} with their own scan operators that extend
\{{FileSourceScanLike}}. With the current concrete-class matches, these
plugins' scan operators are invisible to the bucketing rules —
\{{DisableUnnecessaryBucketedScan}} never finds them and
\{{ExtractJoinWithBuckets}} never extracts their bucket specs.
This is the same class of issue addressed by SPARK-32332 and SPARK-32430 (AQE
hardcoding concrete classes instead of traits), but in the bucketing physical
rules which were not covered by those fixes.
h3. Does this PR introduce any user-facing change?
No. \{{FileSourceScanExec}} already extends \{{FileSourceScanLike}}, so
behavior is unchanged for vanilla Spark. Plugins that extend
\{{FileSourceScanLike}} will now be recognized by the bucketing rules.
> Bucketing rules should match on FileSourceScanLike trait instead of
> FileSourceScanExec
> --------------------------------------------------------------------------------------
>
> Key: SPARK-57064
> URL: https://issues.apache.org/jira/browse/SPARK-57064
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 5.0.0
> Reporter: Norio Akagi
> Priority: Minor
>
> ### What changes were proposed in this pull request?
> `DisableUnnecessaryBucketedScan` and `CoalesceBucketsInJoin` pattern-match
> on the concrete class `FileSourceScanExec` in several read-only match sites
> where only trait-level fields (`bucketedScan`, `relation`,
> `optionalNumCoalescedBuckets`) are accessed. The `FileSourceScanLike` trait
> already declares
> all of these fields, so the matches can safely be widened.
> This PR changes 3 match sites from `FileSourceScanExec` to
> `FileSourceScanLike`:
> - `DisableUnnecessaryBucketedScan.apply` — the `hasBucketedScan` existence
> check
> - `ExtractJoinWithBuckets.hasScanOperation` — the bucket spec existence
> check
> - `ExtractJoinWithBuckets.getBucketSpec` — the bucket spec extraction
> Two match sites that call `.copy()` (a case-class-specific method) are
> intentionally left on `FileSourceScanExec`.
> ### Why are the changes needed?
> Third-party columnar execution plugins (Gluten, Comet, RAPIDS) replace
> `FileSourceScanExec` with their own scan operators that extend
> `FileSourceScanLike`. With the current concrete-class matches, these plugins'
> scan operators are invisible to the bucketing rules —
> `DisableUnnecessaryBucketedScan`
> never finds them and `ExtractJoinWithBuckets` never extracts their bucket
> specs.
> This is the same class of issue addressed by SPARK-32332 and SPARK-32430
> (AQE hardcoding concrete classes instead of traits), but in the bucketing
> physical rules which were not covered by those fixes.
> ### Does this PR introduce _any_ user-facing change?
> No. `FileSourceScanExec` already extends `FileSourceScanLike`, so behavior
> is unchanged for vanilla Spark. Plugins that extend `FileSourceScanLike` will
> now be recognized by the bucketing rules.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]