[
https://issues.apache.org/jira/browse/SPARK-56681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Tenedorio reassigned SPARK-56681:
----------------------------------------
Assignee: Serge Rielau
> PATH cleanup
> ------------
>
> Key: SPARK-56681
> URL: https://issues.apache.org/jira/browse/SPARK-56681
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Affects Versions: 4.2.0
> Reporter: Serge Rielau
> Assignee: Serge Rielau
> Priority: Major
> Labels: pull-request-available
>
> Umbrella for SQL Language follow-ups identified Each item below is
> independently actionable; #1, #4, #5, #6, #7, #11 are 1-line / 1-block
> fixes; #2 and #3 are the substantive follow-up work the SPARK-56605
> docstring already promised.
> Source review:
> [https://github.com/databricks-eng/runtime/pull/215112#issuecomment-4353729745]
> (See the comment on this issue for the full enumerated checklist; each
> item can be broken out into its own linked subtask.)
> h2. Critical
> h3. 1. {{FunctionResolution.resolveProcedure}} is dead code
> File:
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionResolution.scala}}
> The PR adds a ~70-line {{def resolveProcedure(unresolved:
> UnresolvedProcedure): LogicalPlan}} that has no callers in
> {{{}apache/spark{}}}. Procedure resolution is still routed through the
> existing {{Analyzer.ResolveProcedures}} rule into a different code path.
> Downstream forks have to either wire it or delete it on each cherry-pick.
> *Action:* either replace the existing call site to use the new method (and
> delete the older path) or revert the hunk. Decide as part of the wiring PR
> that the SPARK-56605 docstring promises.
> h2. High
> h3. 2. Frozen view / SQL-function PATH wiring is unfinished
> The PR docstring states:
> Frozen path analysis for views/SQL functions (using stored path during
> analysis) comes in a follow-up PR. This PR only wires the resolution engine
> to use the live session path.\{quote}
>
> The follow-up has not landed. Concretely:
> * {{AnalysisContext.resolutionPathEntries}} is declared but never populated
> from view metadata.
> * {{CatalogManager.serializePathEntries}} / {{pathEntriesForPersistence}}
> are declared but never written to view / SQL function properties.
> * {{AnalysisContext.relationResolutionPathCache}} is unused on this branch.
> *Action:* finish the wiring. Capture path entries at CREATE VIEW / CREATE
> FUNCTION time, persist into catalog properties, deserialize on view body /
> SQL function body re-analysis, and populate
> {{AnalysisContext.resolutionPathEntries}} accordingly.
> h3. 3. {{AnalysisContext.resolutionPathEntries}} is a new threadlocal
> The single-pass resolver project explicitly forbids new threadlocal state.
> The frozen view path needs to be plumbed through
> {{OperatorResolutionContextStack}} (or another single-pass-friendly
> propagation mechanism) and the field removed from {{{}AnalysisContext{}}}.
> *Action:* before/together with #2, plumb the frozen path through the
> single-pass operator stack instead of {{{}AnalysisContext{}}}. Audit
> {{AnalysisContext.withNewAnalysisContext}} and {{AnalysisContext.reset}} to
> ensure no leakage even before the field is removed.
> h3. 4. {{Analyzer.executeAndCheck}} clobbers outer
> {{SQLConf.withExistingConf}}
> After the PR, {{Analyzer}} carries a {{sessionConf: Option[SQLConf]}} and
> {{executeAndCheck}} unconditionally wraps analysis in
> {{{}SQLConf.withExistingConf(sessionConf){}}}. This clobbers any outer
> {{withExistingConf}} scope (e.g. a SQL UDF / view body that wired frozen
> configs). {{executeSameContext}} already short-circuits with {{SQLConf.get ne
> c => super.execute(plan)}} but {{executeAndCheck}} does not. The two
> predicates also disagree: {{executeSameContext}} uses {{SQLConf.get}} (falls
> through to {{{}confGetter{}}}) while a correct fix needs to peek at the
> threadlocal directly so "no outer scope" and "outer scope happens to match
> the session conf" are not conflated.
> *Action:* add public {{SQLConf.getExistingConfIfSet: Option[SQLConf]}}
> (returns {{Some}} only when an outer scope explicitly installed a conf via
> {{{}withExistingConf{}}}). Extract one {{runWithSessionConf}} helper in
> {{{}Analyzer{}}}, used by both {{executeAndCheck}} and
> {{{}executeSameContext{}}}, that:
> * returns the thunk unchanged if {{sessionConf}} is {{{}None{}}};
> * returns the thunk unchanged if an outer scope has already installed a
> different conf via {{withExistingConf}} (use {{{}getExistingConfIfSet{}}});
> * otherwise wraps in {{{}SQLConf.withExistingConf(sessionConf){}}}.
> Reproducer: a SQL UDF / view body that captures an {{ANSI_ENABLED}} value at
> CREATE time and is re-analyzed in a session that has flipped {{ANSI_ENABLED}}
> should still see the captured value during analysis. Today the captured value
> is overwritten by the analyzer's {{{}sessionConf{}}}.
> h3. 5. {{VariableResolution.allowUnqualifiedSessionTempVariableLookup}}
> force-loads default catalog
> File:
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/VariableResolution.scala}}
> The check eagerly evaluates {{{}catalogManager.currentCatalog.name(){}}},
> which calls {{Catalogs.load}} on the configured default catalog. If the
> configured default catalog is not a registered plugin (e.g. tests, customer
> sessions with a mistyped {{{}spark.sql.defaultCatalog{}}}), this throws
> {{CATALOG_NOT_FOUND}} mid-analysis on every column-resolution pass that
> reaches the variable-lookup last-resort path.
> When SQL PATH is not explicitly set, the answer is fully determined by
> {{SQLConf.isDefinerProcedureExcludeSessionFromSearchPath}} (the default path
> always contains {{system.session}} otherwise) and never depends on the
> catalog name.
> *Action:* short-circuit before touching the catalog when PATH is not
> explicitly set:
> {code:scala}
> private def allowUnqualifiedSessionTempVariableLookup(nameParts:
> Seq[String]): Boolean = {
> if (nameParts.length != 1) return true
> val pathExplicitlySet = conf.pathEnabled &&
> catalogManager.sessionPathEntries.isDefined
> if (!pathExplicitlySet) {
> return !conf.isDefinerProcedureExcludeSessionFromSearchPath
> }
> catalogManager.sessionScopeUnqualifiedAllowed(
> catalogManager.currentCatalog.name(),
> catalogManager.currentNamespace.toSeq)
> }
> {code}
> h3. 6. DROP VARIABLE PATH gate is asymmetric with DECLARE / CREATE VARIABLE
> File:
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala}}
> The PR added a {{{}system.session{}}}-on-PATH gate to {{DropVariable}} only.
> {{CreateVariable}} (DECLARE / CREATE) has no such gate, so a user can
> {{DECLARE OR REPLACE VARIABLE foo}} they cannot drop under a {{SET PATH}}
> that omits {{{}system.session{}}}. {{SET VAR foo = ...}} (which goes through
> {{VariableResolution.lookupVariable}} and inherits the same gate) also fails
> in that case.
> The intended policy is "DDL ignores PATH, DML uses it":
> * DECLARE / CREATE / DROP VARIABLE always target {{system.session}} directly
> regardless of PATH (no gate).
> * SET VAR / {{SELECT @foo}} resolve through PATH and require
> {{system.session}} on the path (gate stays).
> *Action:* remove the {{if (nameParts.length == 1 &&
> !catalogManager.sessionScopeUnqualifiedAllowed(...)) throw ...}} block from
> the {{DropVariable}} case in {{{}ResolveCatalogs{}}}. Add a regression test
> that exercises a {{DECLARE -> SET VAR (qualified) -> DROP}} cycle under
> {{{}SET PATH = somecat.somesch{}}}.
> h3. 7. {{lookupFunctionType}} exception swallow is too broad
> File:
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionResolution.scala}}
> The PR uses {{{}catch \{ case NonFatal(_) => false }}} around
> {{functionExists{}}}. That swallows {{{}PERMISSION_DENIED{}}}, transient
> catalog errors, and unrelated bugs uniformly – the user sees
> {{UNRESOLVED_ROUTINE}} with no upstream cause, and the underlying error never
> surfaces.
> *Action:* mirror the explicit not-found list already used in
> {{{}resolveFunctionCandidate{}}}:
> {code:scala}
> catch {
> case _: NoSuchFunctionException
> | _: NoSuchDatabaseException
> | _: NoSuchNamespaceException
> | _: CatalogNotFoundException =>
> false
> case e: AnalysisException if e.getCondition == "FORBIDDEN_OPERATION" =>
> false
> }
> {code}
> Other exceptions should propagate.
> h3. 8. {{lookupFunctionType}} fan-out includes wasteful system-prefix entries
> For 1-part names the new code iterates {{resolutionCandidates(nameParts)}}
> and calls {{functionExists}} on every entry. The candidate list includes the
> {{{}system.session.<name>{}}}, {{{}system.builtin.<name>{}}},
> {{system.ai.<name>}} path entries – but routines under those namespaces are
> already resolved by {{lookupBuiltinOrTempFunction}} /
> {{lookupBuiltinOrTempTableFunction}} higher up. Calling {{functionExists}} on
> them wastes catalog calls (UC backends RPC) and is observable under tests
> that count calls.
> *Action:* filter the candidate list down to non-system entries before the
> {{functionExists}} walk:
> {code:scala}
> val persistentCandidates = resolutionCandidates(nameParts).filterNot { c =>
> c.length >= 2 &&
> c.head.equalsIgnoreCase(CatalogManager.SYSTEM_CATALOG_NAME) && {
> val ns = c(1)
> ns.equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)
> || ns.equalsIgnoreCase(CatalogManager.BUILTIN_NAMESPACE)
> || ns.equalsIgnoreCase(CatalogManager.AI_NAMESPACE)
> }
> }
> {code}
> h2. Medium
> h3. 9. Three near-duplicate path-resolution helpers should be lifted to
> {{CatalogManager}}
> {{RelationResolution.relationResolutionEntries}} and
> {{FunctionResolution.sqlResolutionPathEntriesForAnalysis}} each implement
> "view-aware ordered SQL path entries". They will drift. Procedure resolution
> in particular doesn't consult {{{}AnalysisContext.resolutionPathEntries{}}},
> so when #2 lands, {{CALL}} inside a view body will silently use the live
> session path.
> *Action:* add one helper on {{{}CatalogManager{}}}:
> {code:scala}
> def resolutionPathEntriesForAnalysis(
> pinnedEntries: Option[Seq[Seq[String]]],
> viewCatalogAndNamespace: Seq[String]): Seq[Seq[String]]
> {code}
> that picks pinned entries when set (and PATH is enabled) or falls back to
> {{sqlResolutionPathEntries}} with view-aware path/expand args. Route
> relation, function, and procedure resolution all through it.
> h3. 10. Tests for the new error paths and gates
> The PR added at least three new behaviour gates with no direct test coverage
> in {{{}SetPathSuite{}}}:
> * DECLARE / SET VAR / DROP VARIABLE cycle under {{SET PATH =
> somecat.somesch}} (positive case for DDL, negative case for unqualified DML
> lookup).
> * DESCRIBE TABLE error message format under the new {{QueryLike}} /
> {{UnresolvedTableOrViewSearchPathMode}} error path – verify the message lists
> the full PATH.
> * Unqualified variable resolution failure when {{system.session}} is not on
> PATH (e.g. PATH set, {{SELECT @x}} must fail; {{SELECT system.session.x}}
> must succeed).
> *Action:* add {{{}intercept[AnalysisException]{}}}-based tests for the above
> to {{{}SetPathSuite{}}}.
> h3. 11. {{ProtoToParsedPlanTestSuite.analyzerIsolationConf}} is a bare
> {{SQLConf}}
> File:
> {{sql/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala}}
> The PR introduced {{analyzerIsolationConf = new SQLConf()}} to insulate the
> suite from session SET PATH. That bare {{SQLConf}} discards every override
> the test session's {{sparkConf}} sets ({{{}ANSI_ENABLED{}}},
> {{{}ALWAYS_INLINE_COMMON_EXPR{}}}, {{{}USE_COMMON_EXPR_ID_FOR_ALIAS{}}}, ...)
> and exposes the {{SQLConf}} factory defaults. Each lost override is silently
> a coverage gap.
> *Action:* clone the shared session's conf instead, only overriding the knobs
> that genuinely need to differ:
> {code:scala}
> private lazy val analyzerIsolationConf: SQLConf = {
> val c = spark.sessionState.conf.clone()
> c.setConf(SQLConf.PATH_ENABLED, false)
> c
> }
> {code}
> This automatically picks up the test session's overrides and only the genuine
> isolation knob remains explicit.
> ALSO:
> Confirmed — {{ResolveSetVariable.scala:105}} hardcodes {{Seq("SYSTEM",
> "SESSION")}} as the search path it passes to {{{}unresolvedVariableError{}}},
> regardless of the actual SQL PATH. So the error message lies about what was
> searched. That's the same family of bug as Vladimir's #6 (DESCRIBE error
> format) — the error path doesn't get reformatted under the new {{QueryLike}}
> mode either.
> I'll add it to SPARK-56681 as a separate follow-up item and commit the test
> as-is (the hardcoded message means my {{checkError}} matches today, with the
> understanding that the OSS fix will require a corresponding {{parameters =
> Map(... "searchPath" -> "<actual path>")}} update here).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]