srielau commented on code in PR #55717:
URL: https://github.com/apache/spark/pull/55717#discussion_r3196685813
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##########
@@ -113,17 +113,34 @@ class SessionCatalog(
identifier.copy(funcName = "") == SESSION_NAMESPACE_TEMPLATE
/**
- * Session function kinds in resolution order for unqualified lookups.
- * Matches [[SQLConf.sessionFunctionResolutionOrder]]: "first" (session
first),
- * "second" (default), "last" (builtin only; session tried after persistent).
+ * Provider for the ordered `system.builtin` / `system.session` entries on
the effective SQL
+ * PATH (mapped to [[SessionFunctionKind]]). Wired by
+ * [[org.apache.spark.sql.connector.catalog.CatalogManager]] after
construction so unqualified
+ * function lookups and the security check that blocks temp functions from
shadowing builtins
+ * read the live PATH (post-`SET PATH`, with [[SQLConf.DEFAULT_PATH]] and
+ * [[SQLConf.defaultPathOrder]] fallbacks already applied) instead of the
legacy
+ * [[SQLConf.SESSION_FUNCTION_RESOLUTION_ORDER]] proxy.
+ *
+ * The default reproduces the previous "second" behavior
(builtin-then-session) for tests
+ * that construct a [[SessionCatalog]] directly without going through
[[CatalogManager]].
*/
- private def sessionFunctionKindsInResolutionOrder: Seq[SessionFunctionKind]
= {
- conf.sessionFunctionResolutionOrder match {
- case "first" => Seq(Temp, Builtin)
- case "last" => Seq(Builtin)
- case _ => Seq(Builtin, Temp) // "second" (default)
- }
- }
+ private[sql] var sessionFunctionKindsProvider: () =>
Seq[SessionFunctionKind] =
Review Comment:
**Deadlock vector + publication concern on this `var`.**
*Deadlock.* `lookupBuiltinOrTempTableFunction` (line 2519, master) holds
`SessionCatalog.synchronized` and now reaches `sessionFunctionKindsProvider()`
→ `CatalogManager.leadingSystemFunctionKinds()` → `currentCatalog.name()` /
`sqlResolutionPathEntries` (`CatalogManager.synchronized`). Lock order: SC →
CM. Conversely, `CatalogManager.setCurrentNamespace`
(`CatalogManager.scala:159`) holds CM and calls
`v1SessionCatalog.setCurrentDatabaseWithNameCheck` → `synchronized { currentDb
= … }` (SC). Lock order: CM → SC. Classic AB-BA pattern. Pre-PR,
`sessionFunctionKindsInResolutionOrder` only read
`conf.sessionFunctionResolutionOrder` and never crossed into CM, so this hazard
is new.
*Publication.* The `var` is mutated from `CatalogManager`'s constructor body
without `@volatile`; under JMM, another thread that obtained a `SessionCatalog`
reference may observe the default callback even after the wiring statement runs.
Options: (a) compute the kinds outside the SC lock — e.g., drop
`synchronized` on `lookupBuiltinOrTempTableFunction` (it doesn't actually need
it given underlying registry lookups), or pre-compute kinds before calling
`lookupFunctionWithShadowing`; (b) replace the `var` with a back-reference to
`CatalogManager` passed at construction (eliminates publication, makes the
dependency explicit); (c) cache the kinds list inside `CatalogManager` so the
callback doesn't acquire CM in the hot path.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##########
@@ -113,17 +113,34 @@ class SessionCatalog(
identifier.copy(funcName = "") == SESSION_NAMESPACE_TEMPLATE
/**
- * Session function kinds in resolution order for unqualified lookups.
- * Matches [[SQLConf.sessionFunctionResolutionOrder]]: "first" (session
first),
- * "second" (default), "last" (builtin only; session tried after persistent).
+ * Provider for the ordered `system.builtin` / `system.session` entries on
the effective SQL
+ * PATH (mapped to [[SessionFunctionKind]]). Wired by
+ * [[org.apache.spark.sql.connector.catalog.CatalogManager]] after
construction so unqualified
+ * function lookups and the security check that blocks temp functions from
shadowing builtins
+ * read the live PATH (post-`SET PATH`, with [[SQLConf.DEFAULT_PATH]] and
+ * [[SQLConf.defaultPathOrder]] fallbacks already applied) instead of the
legacy
+ * [[SQLConf.SESSION_FUNCTION_RESOLUTION_ORDER]] proxy.
+ *
+ * The default reproduces the previous "second" behavior
(builtin-then-session) for tests
+ * that construct a [[SessionCatalog]] directly without going through
[[CatalogManager]].
*/
- private def sessionFunctionKindsInResolutionOrder: Seq[SessionFunctionKind]
= {
- conf.sessionFunctionResolutionOrder match {
- case "first" => Seq(Temp, Builtin)
- case "last" => Seq(Builtin)
- case _ => Seq(Builtin, Temp) // "second" (default)
- }
- }
+ private[sql] var sessionFunctionKindsProvider: () =>
Seq[SessionFunctionKind] =
Review Comment:
**Test default silently ignores `SESSION_FUNCTION_RESOLUTION_ORDER`.**
Pre-PR `sessionFunctionKindsInResolutionOrder` switched on the conf. Tests that
construct `SessionCatalog` directly without `CatalogManager` (`AnalysisTest`,
`SessionCatalogSuite`, `LookupFunctionsSuite`, `TableLookupCacheSuite`,
`PlanResolutionSuite`, …) now get `Seq(Builtin, Temp)` regardless of
`withSQLConf(SESSION_FUNCTION_RESOLUTION_ORDER → "first"|"last")`. Either
restore the conf-switch as the default so existing tests keep working
unchanged, or document the new contract loudly.
```suggestion
private[sql] var sessionFunctionKindsProvider: () =>
Seq[SessionFunctionKind] =
() => conf.sessionFunctionResolutionOrder match {
case "first" => Seq(Temp, Builtin)
case "last" => Seq(Builtin)
case _ => Seq(Builtin, Temp)
}
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala:
##########
@@ -52,6 +53,38 @@ class CatalogManager(
// TODO: create a real SYSTEM catalog to host `TempVariableManager` under
the SESSION namespace.
val tempVariableManager: TempVariableManager = new TempVariableManager
+ // Wire the path-driven kinds provider into SessionCatalog so the
function-resolution loop
+ // and the security check that blocks temp functions from shadowing builtins
read the live
+ // SQL PATH (post-`SET PATH`, with `DEFAULT_PATH` and `defaultPathOrder`
fallbacks already
+ // applied) instead of the legacy
[[SQLConf.SESSION_FUNCTION_RESOLUTION_ORDER]] proxy.
+ // Order is every `system.builtin` / `system.session` entry on the PATH in
path order
+ // (user-catalog segments in between are skipped for this list).
+ v1SessionCatalog.sessionFunctionKindsProvider = () =>
leadingSystemFunctionKinds()
+
+ /**
+ * Walk the effective SQL PATH and collect every system function namespace
entry
+ * (`system.builtin` / `system.session`) in path order, mapped to
+ * [[SessionCatalog.SessionFunctionKind]].
+ *
+ * User-catalog entries (e.g. `spark_catalog.default`) appear on the PATH
before
+ * `system.builtin` / `system.session`; they must not hide later system
entries.
+ * Unqualified builtin/temp resolution follows this ordered kind list via
+ *
[[org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunctionWithShadowing]].
+ *
+ * When no system entries appear on the PATH (user-defined path without
builtins/session),
+ * falls back to builtin-then-session so bare names still resolve like the
legacy default.
+ */
+ private def leadingSystemFunctionKinds():
Seq[SessionCatalog.SessionFunctionKind] = {
+ val path = sqlResolutionPathEntries(currentCatalog.name(),
currentNamespace.toSeq)
+ val kinds = path.flatMap { e =>
Review Comment:
**Behavior change in `"last"` mode that the PR description doesn't call
out.** `defaultPathOrder("last")` returns `[builtin, ..catalog.., session]`, so
this `flatMap` extracts `Seq(Builtin, Temp)`. Pre-PR
`sessionFunctionKindsInResolutionOrder` returned `Seq(Builtin)` for `"last"` —
session was deliberately excluded from the system-only shortcut.
Concrete consequence: `DESCRIBE FUNCTION foo` (`Analyzer.scala:2258`, via
`lookupBuiltinOrTempFunction`) in `"last"` mode now returns a temp `foo` even
when a persistent `foo` exists in the user catalog. Previously the shortcut
returned None and the rule fell through to persistent lookup. Same change
affects `Analyzer.LookupFunctions.quickCheck` (line 2186) and
`HigherOrderFunctionResolver` (line 109).
Either preserve `Seq(Builtin)` for `"last"` mode in the path-derived
computation (e.g., only collect kinds *up to the first user-catalog entry*, so
`[builtin, ..catalog.., session]` yields `Seq(Builtin)`), or document the
change and add an explicit test (`DESCRIBE FUNCTION foo` with both temp and
persistent in `"last"` mode).
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala:
##########
@@ -138,8 +171,39 @@ class CatalogManager(
private var _sessionPath: Option[Seq[SessionPathEntry]] = None
- /** Returns the raw stored session path entries, or None if no path is set.
*/
- def sessionPathEntries: Option[Seq[SessionPathEntry]] = synchronized {
_sessionPath }
+ /**
+ * Returns the effective session path entries: the explicit `SET PATH` value
if stored,
+ * else the parsed [[SQLConf.DEFAULT_PATH]] conf if non-empty (mirroring how
+ * [[currentCatalog]] falls back to [[SQLConf.DEFAULT_CATALOG]]). Returns
`None` when
+ * [[SQLConf.PATH_ENABLED]] is false or both sources are empty.
+ */
+ def sessionPathEntries: Option[Seq[SessionPathEntry]] = synchronized {
+ if (!conf.pathEnabled) None
+ else _sessionPath.orElse(workspaceDefaultPathEntries)
+ }
+
+ /** Raw `_sessionPath` (post-`SET PATH`), without the
[[SQLConf.DEFAULT_PATH]] fallback. */
+ def storedSessionPathEntries: Option[Seq[SessionPathEntry]] = synchronized {
_sessionPath }
+
+ /**
+ * Parsed [[SQLConf.DEFAULT_PATH]] value, or `None` when the conf is empty /
blank. Reuses
+ * the SET PATH grammar via [[CatalystSqlParser.parseSqlPathElements]] so
the workspace
+ * default accepts the same syntax as `SET PATH = ...`.
+ *
+ * An inner `DEFAULT_PATH` token inside the conf value resolves to the
spark-builtin
+ * default ordering rather than recursing back into this method (cycle
break).
+ */
+ def workspaceDefaultPathEntries: Option[Seq[SessionPathEntry]] = {
Review Comment:
**Naming: `workspace*`.** "Workspace" isn't a Spark/OSS concept; the conf is
`spark.sql.defaultPath`, not `spark.sql.workspaceDefaultPath`. Suggest
`confDefaultPathEntries` / `isConfDefaultExpansion` (or
`defaultPathConfEntries`) — applies here, on
`_sessionPath.orElse(workspaceDefaultPathEntries)` (line 182), on the parameter
`isWorkspaceDefaultExpansion` in `PathElement.expand`, and inside
`PathElement.scala`'s Scaladoc.
Also: this method doesn't apply the duplicate-entry rejection that
`SetPathCommand.run` does (`SetPathCommand.scala:51-56`), so `DEFAULT_PATH =
"system.builtin, system.builtin"` is silently accepted while the equivalent
`SET PATH` is rejected. Easiest fix: pull the dedup into `PathElement.expand`
(or a small helper) and run it on both paths.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala:
##########
@@ -52,6 +53,38 @@ class CatalogManager(
// TODO: create a real SYSTEM catalog to host `TempVariableManager` under
the SESSION namespace.
val tempVariableManager: TempVariableManager = new TempVariableManager
+ // Wire the path-driven kinds provider into SessionCatalog so the
function-resolution loop
+ // and the security check that blocks temp functions from shadowing builtins
read the live
+ // SQL PATH (post-`SET PATH`, with `DEFAULT_PATH` and `defaultPathOrder`
fallbacks already
+ // applied) instead of the legacy
[[SQLConf.SESSION_FUNCTION_RESOLUTION_ORDER]] proxy.
+ // Order is every `system.builtin` / `system.session` entry on the PATH in
path order
+ // (user-catalog segments in between are skipped for this list).
+ v1SessionCatalog.sessionFunctionKindsProvider = () =>
leadingSystemFunctionKinds()
+
+ /**
+ * Walk the effective SQL PATH and collect every system function namespace
entry
+ * (`system.builtin` / `system.session`) in path order, mapped to
+ * [[SessionCatalog.SessionFunctionKind]].
+ *
+ * User-catalog entries (e.g. `spark_catalog.default`) appear on the PATH
before
+ * `system.builtin` / `system.session`; they must not hide later system
entries.
+ * Unqualified builtin/temp resolution follows this ordered kind list via
+ *
[[org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunctionWithShadowing]].
+ *
+ * When no system entries appear on the PATH (user-defined path without
builtins/session),
+ * falls back to builtin-then-session so bare names still resolve like the
legacy default.
+ */
+ private def leadingSystemFunctionKinds():
Seq[SessionCatalog.SessionFunctionKind] = {
Review Comment:
**Coordination layer doing strategy work** (open to discussion, not
blocking). Per Spark's name-resolution layer model, `CatalogManager`
(Coordination) shouldn't *generate resolution candidates or decide resolution
order* — that's `FunctionResolution`'s (Strategy) job.
`leadingSystemFunctionKinds` and `isSessionBeforeBuiltinInPath` derive
resolution order from the path.
The pre-PR site of this logic was inside `SessionCatalog`, which was also
wrong by the same model — so this PR moves the violation rather than resolving
it. Cleanest home would be `FunctionResolution` (it already consumes
`sqlResolutionPathEntries` via `sqlResolutionPathEntriesForAnalysis`, and the
lookup shortcuts in `SessionCatalog` would just receive a kinds list as a
parameter or via a thin helper). Curious if this was considered.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala:
##########
@@ -138,8 +171,39 @@ class CatalogManager(
private var _sessionPath: Option[Seq[SessionPathEntry]] = None
- /** Returns the raw stored session path entries, or None if no path is set.
*/
- def sessionPathEntries: Option[Seq[SessionPathEntry]] = synchronized {
_sessionPath }
+ /**
+ * Returns the effective session path entries: the explicit `SET PATH` value
if stored,
+ * else the parsed [[SQLConf.DEFAULT_PATH]] conf if non-empty (mirroring how
+ * [[currentCatalog]] falls back to [[SQLConf.DEFAULT_CATALOG]]). Returns
`None` when
+ * [[SQLConf.PATH_ENABLED]] is false or both sources are empty.
+ */
+ def sessionPathEntries: Option[Seq[SessionPathEntry]] = synchronized {
+ if (!conf.pathEnabled) None
+ else _sessionPath.orElse(workspaceDefaultPathEntries)
+ }
+
+ /** Raw `_sessionPath` (post-`SET PATH`), without the
[[SQLConf.DEFAULT_PATH]] fallback. */
+ def storedSessionPathEntries: Option[Seq[SessionPathEntry]] = synchronized {
_sessionPath }
+
+ /**
+ * Parsed [[SQLConf.DEFAULT_PATH]] value, or `None` when the conf is empty /
blank. Reuses
+ * the SET PATH grammar via [[CatalystSqlParser.parseSqlPathElements]] so
the workspace
+ * default accepts the same syntax as `SET PATH = ...`.
+ *
+ * An inner `DEFAULT_PATH` token inside the conf value resolves to the
spark-builtin
+ * default ordering rather than recursing back into this method (cycle
break).
+ */
+ def workspaceDefaultPathEntries: Option[Seq[SessionPathEntry]] = {
+ val confValue = conf.defaultPath
+ if (confValue == null || confValue.trim.isEmpty) {
+ None
+ } else {
+ val elements = CatalystSqlParser.parseSqlPathElements(confValue)
Review Comment:
**Hot-path re-parse + parsing inside `CatalogManager.synchronized`.** This
method runs on every `sessionPathEntries` call when `_sessionPath` is None and
`PATH_ENABLED` is true — every unqualified function/table resolution, every
`count(*)→count(1)` gate via `isSessionBeforeBuiltinInPath`, every
`currentPathString`. With `DEFAULT_PATH` set, that's an ANTLR parse +
`PathElement.expand` per call, all while `sessionPathEntries` (line 180) holds
the CatalogManager lock.
Cache the parsed result keyed on the conf string — invalidate on string
change. Even a simple `@volatile var cachedDefaultPath: Option[(String,
Option[Seq[SessionPathEntry]])]` (read outside the synchronized block, written
under it) would eliminate the per-call parse cost in the steady state and move
parsing out of the hot lock.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]