[
https://issues.apache.org/jira/browse/SPARK-55928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-55928:
-----------------------------------
Labels: pull-request-available (was: )
> New linter for config effectiveness in views, UDFs and procedures
> -----------------------------------------------------------------
>
> Key: SPARK-55928
> URL: https://issues.apache.org/jira/browse/SPARK-55928
> Project: Spark
> Issue Type: Documentation
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Mihailo Timotic
> Priority: Major
> Labels: pull-request-available
>
> Summary
>
> Introduce a ConfigBindingPolicy framework in Apache Spark that enforces all
> newly added configurations to explicitly declare how their values are bound
> when used within SQL views, UDFs, or procedures. This replaces the manually
> maintained hardcoded RETAINED_ANALYSIS_FLAGS allowlist in the Analyzer with a
> dynamic, policy-driven approach.
>
>
> Background: Conf + views mechanics
>
> There are 3 ways Spark configs can interact with views:
>
> 1. The conf value is stored with a view/UDF/procedure on creation and is
> applied on read. Session value is deprioritized. Example: ANSI conf, timezone.
>
> 2. The conf is not stored with a view, but its value is propagated through a
> view from the active session. Example: kill-switches, feature flags.
>
> 3. The conf is neither stored with a view, nor propagated through a view.
> This is the historical default in Spark.
>
> The confusion arises for configurations that are not captured on
> view/UDF/procedure creation, but still need to be used when querying them.
> The common assumption is that if a conf is not preserved upon creation, its
> value inside the view/UDF/procedure will be whatever the value is in the
> currently active session. This is NOT true.
>
> If a conf is not preserved on creation, its value when querying the
> view/UDF/procedure will be:
> - The value from the currently active session, only if the conf is in a
> hardcoded allowlist (RETAINED_ANALYSIS_FLAGS in Analyzer.scala).
> - The Spark default otherwise.
>
> This allowlist is extremely non-obvious and easy to forget about. This has
> caused regressions in the past where new configs affecting query semantics
> were not added to the allowlist, causing views and UDFs to silently use Spark
> defaults instead of session values.
>
>
> Problem
>
> The Analyzer.RETAINED_ANALYSIS_FLAGS list is a manually maintained hardcoded
> allowlist of configs that should propagate from the active session when
> resolving views and SQL UDFs. This approach is error-prone: developers adding
> new configs that affect query semantics can easily forget to add them to this
> list, causing subtle bugs where views and UDFs silently use Spark defaults
> instead of session values. There is no automated enforcement to catch missing
> entries. Even within analysis, Spark can trigger a Spark job recursively
> which would potentially reference any config (for example, this is needed for
> schema inference), so the scope of affected configs is broader than it
> appears.
>
>
> Proposed Solution
>
> Introduce a ConfigBindingPolicy enum and require all newly added configs to
> explicitly declare a binding policy. This forces developers to think about
> how their config interacts with views, UDFs, and procedures at definition
> time.
>
> The enum has three values:
>
> - SESSION: The config value propagates from the active session to
> views/UDFs/procedures. This is the most common policy. Use it for feature
> flags or bugfix kill-switches where uniform behavior across the entire query
> is desired. Think about it this way: if you make a behavior change and roll
> it out on by default, then discover a bug and need to revert it -- unless the
> policy is SESSION, existing views will still have the old behavior baked in.
> Examples: plan change logging (spark.sql.planChangeLog.level), bugfixes
> (spark.sql.analyzer.preferColumnOverLcaInArrayIndex).
>
> - PERSISTED: The config uses the value saved at view/UDF/procedure creation
> time, or the Spark default if none was saved. Use for configs that carry view
> semantic meaning that should be consistent regardless of session changes. A
> good example is ANSI mode -- views created with ANSI off should always have
> ANSI off, regardless of the session value.
>
> - NOT_APPLICABLE: The config does not interact with view/UDF/procedure
> resolution at all. Only choose this if you are confident the config doesn't
> interact with view/UDF/procedure analysis. If accessed at runtime, it behaves
> the same as SESSION. Examples: UI confs, server confs.
>
> The hardcoded RETAINED_ANALYSIS_FLAGS list is replaced with a dynamic lookup
> that retains all configs with SESSION or NOT_APPLICABLE binding policy when
> resolving views and SQL UDFs. Configs that were previously in the hardcoded
> list are annotated with withBindingPolicy(SESSION).
>
> A new enforcement test fails if any newly added config does not declare a
> bindingPolicy. Existing configs without a binding policy have been
> grandfathered into an exceptions allowlist. The long-term goal is to have all
> configs declare a binding policy and remove the exceptions allowlist entirely.
>
> Why are all confs affected by the linter? Even within analysis, Spark can
> trigger a Spark job recursively which would potentially reference any conf
> (for example, this is needed for schema inference). The linter is active for
> all newly added confs regardless of whether they directly interact with view
> analysis.
>
> Why not fix all existing confs? Currently there are over a thousand distinct
> configs in Spark. Fixing every single conf would introduce behavior changes.
> The linter only enforces the policy on new additions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]