[ 
https://issues.apache.org/jira/browse/SPARK-55928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-55928:
-----------------------------------
    Labels: pull-request-available  (was: )

> New linter for config effectiveness in views, UDFs and procedures
> -----------------------------------------------------------------
>
>                 Key: SPARK-55928
>                 URL: https://issues.apache.org/jira/browse/SPARK-55928
>             Project: Spark
>          Issue Type: Documentation
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Mihailo Timotic
>            Priority: Major
>              Labels: pull-request-available
>
> Summary
>  
> Introduce a ConfigBindingPolicy framework in Apache Spark that enforces all 
> newly added configurations to explicitly declare how their values are bound 
> when used within SQL views, UDFs, or procedures. This replaces the manually 
> maintained hardcoded RETAINED_ANALYSIS_FLAGS allowlist in the Analyzer with a 
> dynamic, policy-driven approach.
>  
>  
> Background: Conf + views mechanics
>  
> There are 3 ways Spark configs can interact with views:
>  
> 1. The conf value is stored with a view/UDF/procedure on creation and is 
> applied on read. Session value is deprioritized. Example: ANSI conf, timezone.
>  
> 2. The conf is not stored with a view, but its value is propagated through a 
> view from the active session. Example: kill-switches, feature flags.
>  
> 3. The conf is neither stored with a view, nor propagated through a view. 
> This is the historical default in Spark.
>  
> The confusion arises for configurations that are not captured on 
> view/UDF/procedure creation, but still need to be used when querying them. 
> The common assumption is that if a conf is not preserved upon creation, its 
> value inside the view/UDF/procedure will be whatever the value is in the 
> currently active session. This is NOT true.
>  
> If a conf is not preserved on creation, its value when querying the 
> view/UDF/procedure will be:
> - The value from the currently active session, only if the conf is in a 
> hardcoded allowlist (RETAINED_ANALYSIS_FLAGS in Analyzer.scala).
> - The Spark default otherwise.
>  
> This allowlist is extremely non-obvious and easy to forget about. This has 
> caused regressions in the past where new configs affecting query semantics 
> were not added to the allowlist, causing views and UDFs to silently use Spark 
> defaults instead of session values.
>  
>  
> Problem
>  
> The Analyzer.RETAINED_ANALYSIS_FLAGS list is a manually maintained hardcoded 
> allowlist of configs that should propagate from the active session when 
> resolving views and SQL UDFs. This approach is error-prone: developers adding 
> new configs that affect query semantics can easily forget to add them to this 
> list, causing subtle bugs where views and UDFs silently use Spark defaults 
> instead of session values. There is no automated enforcement to catch missing 
> entries. Even within analysis, Spark can trigger a Spark job recursively 
> which would potentially reference any config (for example, this is needed for 
> schema inference), so the scope of affected configs is broader than it 
> appears.
>  
>  
> Proposed Solution
>  
> Introduce a ConfigBindingPolicy enum and require all newly added configs to 
> explicitly declare a binding policy. This forces developers to think about 
> how their config interacts with views, UDFs, and procedures at definition 
> time.
>  
> The enum has three values:
>  
> - SESSION: The config value propagates from the active session to 
> views/UDFs/procedures. This is the most common policy. Use it for feature 
> flags or bugfix kill-switches where uniform behavior across the entire query 
> is desired. Think about it this way: if you make a behavior change and roll 
> it out on by default, then discover a bug and need to revert it -- unless the 
> policy is SESSION, existing views will still have the old behavior baked in. 
> Examples: plan change logging (spark.sql.planChangeLog.level), bugfixes 
> (spark.sql.analyzer.preferColumnOverLcaInArrayIndex).
>  
> - PERSISTED: The config uses the value saved at view/UDF/procedure creation 
> time, or the Spark default if none was saved. Use for configs that carry view 
> semantic meaning that should be consistent regardless of session changes. A 
> good example is ANSI mode -- views created with ANSI off should always have 
> ANSI off, regardless of the session value.
>  
> - NOT_APPLICABLE: The config does not interact with view/UDF/procedure 
> resolution at all. Only choose this if you are confident the config doesn't 
> interact with view/UDF/procedure analysis. If accessed at runtime, it behaves 
> the same as SESSION. Examples: UI confs, server confs.
>  
> The hardcoded RETAINED_ANALYSIS_FLAGS list is replaced with a dynamic lookup 
> that retains all configs with SESSION or NOT_APPLICABLE binding policy when 
> resolving views and SQL UDFs. Configs that were previously in the hardcoded 
> list are annotated with withBindingPolicy(SESSION).
>  
> A new enforcement test fails if any newly added config does not declare a 
> bindingPolicy. Existing configs without a binding policy have been 
> grandfathered into an exceptions allowlist. The long-term goal is to have all 
> configs declare a binding policy and remove the exceptions allowlist entirely.
>  
> Why are all confs affected by the linter? Even within analysis, Spark can 
> trigger a Spark job recursively which would potentially reference any conf 
> (for example, this is needed for schema inference). The linter is active for 
> all newly added confs regardless of whether they directly interact with view 
> analysis.
>  
> Why not fix all existing confs? Currently there are over a thousand distinct 
> configs in Spark. Fixing every single conf would introduce behavior changes. 
> The linter only enforces the policy on new additions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to