[ 
https://issues.apache.org/jira/browse/SPARK-57194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57194:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add earlyOperatorOptimizationRules extension point to Optimizer
> ---------------------------------------------------------------
>
>                 Key: SPARK-57194
>                 URL: https://issues.apache.org/jira/browse/SPARK-57194
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.4.1
>            Reporter: Emily Sun
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Problem
> Custom optimizer rules injected via 
> *SparkSessionExtensions.injectOptimizerRule*
> run inside the fixed-point {*}Operator Optimization batch{*}, alongside 
> built-in
> rewriters like {_}FoldablePropagation, ConstantFolding, and 
> PushDownPredicates{_}{*}.{*}
> A custom rule that needs to observe the original plan shape (e.g. cross-side 
> join
> predicates before they are folded into single-side constants) can be silently
> defeated when a built-in rule transforms the plan first within the same fixed 
> point.
> Existing extension points don't cover this:
>  * *extendedOperatorOptimizationRules:* runs inside the same fixed-point batch
>  * *extendedResolutionRules / postHocResolutionRules:* analyzer phase, too 
> early
>  * *earlyScanPushDownRules:* runs after optimization, scoped to scan pushdown
> h2. Proposed change
> Add *earlyOperatorOptimizationRules* on {*}Optimizer{*}, executed in a *Once*
> batch named "Early Operator Optimization" placed between *Replace Operators*
> and *Aggregate,* before the fixed-point Operator Optimization batch.
> {code:java}
> Batch("Replace Operators", fixedPoint,
>   RewriteExceptAll,
>   RewriteIntersectAll,
>   ReplaceIntersectWithSemiJoin,
>   ReplaceExceptWithFilter,
>   ReplaceExceptWithAntiJoin,
>   ReplaceDistinctWithAggregate,
>   ReplaceDeduplicateWithAggregate) ::
> Batch("Early Operator Optimization", Once,
>   earlyOperatorOptimizationRules: _*) ::
> Batch("Aggregate", fixedPoint,
>   RemoveLiteralFromGroupExpressions,
>   RemoveRepetitionFromGroupExpressions) :: Nil ++ {code}
> Wired through *SparkSessionExtensions.injectEarlyOptimizerRule* and
> {*}BaseSessionStateBuilder{*}. The batch is a no-op when no rule is 
> registered.
> h2. Use cases
>  * Outer-join semantics rules that must inspect cross-side join predicates 
> before *FoldablePropagation* rewrites them into single-side foldable 
> expressions(the concrete motivation for this proposal).
>  * Plan-shape analyses / tagging rules that must run on a canonical 
> pre-optimization plan.
>  * Any custom rule that conceptually belongs to a single Once-batch pass 
> before the main optimizer iterates to fixed point.
> h2. Benefits
>  * *Fills a extensibility gap.* None of the existing injection points can 
> host a rule that must observe the pre-fixed-point plan, this extension point 
> helps fill the gap for such use cases. 
>  * *General, not use-case-specific.* The hook is a neutral ordering primitive 
> - any rule needing pre-optimization plan shape benefits.
>  * *Low risk to adopt.* Purely additive, default to {{{}Nil{}}}, no change to 
> existing batch ordering or APIs, and a no-op when unused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to