LuciferYang opened a new pull request, #55229: URL: https://github.com/apache/spark/pull/55229
## PR Description ### What changes were proposed in this pull request? Enables bucket pruning and bucket join optimization for V2 file tables (`BatchScanExec` with `FileScan`), matching the V1 `FileSourceScanExec` behavior. - Thread `BucketSpec` from `FileTable` through `FileScanBuilder` to `FileScan` and all 6 concrete scan classes (Parquet, ORC, CSV, JSON, Text, Avro) - Implement bucketed file grouping in `FileScan.partitions` — files are grouped by bucket ID extracted from filenames, with optional bucket pruning and coalescing - Report `HashPartitioning` from `DataSourceV2ScanExecBase.outputPartitioning` for bucketed scans, enabling shuffle-free joins - Extend `DisableUnnecessaryBucketedScan` to handle `BatchScanExec` — disables bucketed scan when no downstream operator benefits from it - Extend `CoalesceBucketsInJoin` to handle `BatchScanExec` — coalesces bucket counts for joins between tables with different bucket numbers - Reuse `FileSourceStrategy.genBucketSet` (widened to `private[sql]`) for bucket pruning filter analysis ### Why are the changes needed? V2 file tables (default after the gate removal in SPARK-56170) do not support bucket pruning or bucket join optimizations. Workloads that rely on bucketed tables for performance would regress when using V2 file tables. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization that makes V2 file tables match V1 behavior for bucketed reads. ### How was this patch tested? New `V2BucketedReadSuite` with 6 tests covering bucket pruning (equality + IN filters), bucketed join shuffle avoidance, disable unnecessary bucketed scan, bucket coalescing, and config-based bucketing disable. Existing `BucketedReadSuite` (31 tests), `DisableUnnecessaryBucketedScanSuite`, and `CoalesceBucketsInJoinSuite` all pass (50 tests total). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
