sunchao opened a new pull request, #56071:
URL: https://github.com/apache/spark/pull/56071

   ### What changes were proposed in this pull request?
   
   This PR extends dynamic partition pruning (DPP) eligibility for small, 
already
   materialized filtering sides:
   
   - `LocalRelation`, which represents locally available rows.
   - `LogicalRDD` produced by `checkpoint()` or `localCheckpoint()`.
   
   Checkpoint-created `LogicalRDD`s carry an explicit marker so that DPP is not
   enabled for arbitrary `LogicalRDD` inputs that may require recomputing an
   upstream query. This also keeps recursive CTE and `foreachBatch`-constructed
   inputs outside the new eligibility rule.
   
   This supersedes the unmerged approach in #53324 with narrower `LogicalRDD`
   handling while addressing SPARK-54593.
   
   ### Why are the changes needed?
   
   DPP currently requires a filtering predicate in the build-side logical plan.
   When a small filtering side is already materialized as a `LocalRelation` or a
   checkpointed `LogicalRDD`, that predicate is no longer present, so Spark 
misses
   partition pruning opportunities.
   
   This occurs for joins where a partition expression is matched to a small set 
of
   keys, for example `concat_ws("||", hour, category) = hc_key`. Although the
   expression is composed only from partition columns, the partitioned scan is 
not
   dynamically pruned when the filtering side is materialized.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Queries joining a partitioned file-source table with a small
   `LocalRelation` or checkpointed filtering side may now perform dynamic
   partition pruning and scan fewer partitions. There is no API change.
   
   ### How was this patch tested?
   
   - Added positive coverage for DPP using a `LocalRelation` build side with an
     expression over partition columns.
   - Added positive coverage for DPP using a `localCheckpoint()` build side with
     the same expression form.
   - Added negative coverage confirming that a non-checkpointed `LogicalRDD` 
does
     not become DPP-eligible.
   - Ran `build/sbt 'sql/testOnly 
org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOn 
org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff'`.
   - Ran `build/sbt sql/scalastyle sql/Test/scalastyle`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenAI Codex
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to