Re: [PR] [SPARK-56984][DOCS] Document the SQL PATH feature [spark]

via GitHub Fri, 22 May 2026 04:51:24 -0700


cloud-fan commented on code in PR #56040:
URL: https://github.com/apache/spark/pull/56040#discussion_r3288137139



##########
docs/sql-migration-guide.md:
##########
@@ -31,6 +31,9 @@ license: |
 - Since Spark 4.2, Spark enables order-independent checksums for shuffle 
outputs by default to detect data inconsistencies during indeterminate shuffle 
stage retries. If a checksum mismatch is detected, Spark rolls back and 
re-executes all succeeding stages that depend on the shuffle output. If rolling 
back is not possible for some succeeding stages, the job will fail. To restore 
the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` 
and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to 
`false`.
 - Since Spark 4.2, support for Derby JDBC datasource is deprecated.
 - Since Spark 4.2, a new default method `mergeWith` has been added to the 
`CustomTaskMetric` interface. The default implementation sums the two metric 
values, which is correct for count-type metrics. Data source connector 
implementations that report non-additive metrics (e.g., maximum, average, 
compression ratio, or gauge values) must override `mergeWith` to provide 
correct merge semantics.
+- Since Spark 4.2, the virtual `system` catalog hosts the new `system.builtin` 
and `system.session` namespaces. `system.builtin` exposes built-in functions 
and functions injected through `SparkSessionExtensions`; `system.session` 
exposes temporary views, temporary functions, and session variables created in 
the current session. As a result, 2-part references like `builtin.func()` and 
`session.func()` now follow a mini-path that tries the system namespace first 
and the current catalog second, so a persistent schema named `builtin` or 
`session` is no longer reached by `builtin.func()` / `session.func()` when the 
system namespace contains an object of the same name. To restore the previous 
behavior (current catalog first), set `spark.sql.legacy.persistentCatalogFirst` 
to `true`. Persistent schemas with these names are still allowed but should be 
reached with an explicit catalog prefix (for example, 
`spark_catalog.session.x`). See [Reserved system 
names](sql-ref-identifier.html#reserved
 -system-names).
+- Since Spark 4.2, `CREATE TEMPORARY VIEW`, `CREATE TEMPORARY FUNCTION`, and 
the corresponding `DROP` statements accept the `session` and `system.session` 
qualifiers on the object name (in addition to the previously supported 
unqualified form); for example, `CREATE TEMPORARY VIEW system.session.v AS ...` 
and `DROP TEMPORARY FUNCTION session.f` are now valid. Any other qualifier on a 
temporary object is rejected with `INVALID_TEMP_OBJ_QUALIFIER`.
+- Spark 4.2 introduces the SQL standard `PATH` feature: the `SET PATH` 
statement, the `current_path()` function, the path-based resolution of 
unqualified routines / tables / views, and the configurations 
`spark.sql.path.enabled` (default `false`) and `spark.sql.defaultPath`. The 
feature is opt-in; when `spark.sql.path.enabled` is `false`, unqualified 
resolution falls back to a fixed default path and `SET PATH` is rejected with 
`UNSUPPORTED_FEATURE.SET_PATH_WHEN_DISABLED`. See [SET 
PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html) and [Name 
Resolution](sql-ref-name-resolution.html).

Review Comment:
   Two small things on this bullet:
   
   1. It lists "unqualified routines / tables / views" but omits session 
variables, even though the SET PATH page documents them as a PATH consumer and 
`SetPathSuite` has a dedicated test ("unqualified SET VAR follows PATH").
   2. Stylistic: every other bullet in the 4.2 section opens with "Since Spark 
4.2,"; this one opens with "Spark 4.2 introduces...".
   
   ```suggestion
   - Since Spark 4.2, the SQL standard `PATH` feature is available: the `SET 
PATH` statement, the `current_path()` function, path-based resolution of 
unqualified routines, tables, views, and session variables, and the 
configurations `spark.sql.path.enabled` (default `false`) and 
`spark.sql.defaultPath`. The feature is opt-in; when `spark.sql.path.enabled` 
is `false`, unqualified resolution falls back to a fixed default path and `SET 
PATH` is rejected with `UNSUPPORTED_FEATURE.SET_PATH_WHEN_DISABLED`. See [SET 
PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html) and [Name 
Resolution](sql-ref-name-resolution.html).
   ```



##########
docs/sql-ref-syntax-ddl-create-sql-function.md:
##########
@@ -124,6 +140,15 @@ characteristic
   - [Aggregate functions](sql-ref-functions-builtin.md#aggregate-functions)
   - [Window functions](sql-ref-functions-builtin.md#analytic-window-functions)
   - [Ranking functions](sql-ref-functions-builtin.md#ranking-window-functions)
+
+  A persistent SQL UDF cannot reference temporary views, temporary functions, 
or session
+  variables.
+
+  The SQL Path in effect at `CREATE FUNCTION` time is captured into the 
function's metadata; the
+  body resolves against that frozen path on every invocation, not the 
invoker's current path.
+  `current_schema()` and `current_path()` inside the body still return the 
invoker's context.
+  Use [DESCRIBE FUNCTION EXTENDED](sql-ref-syntax-aux-describe-function.html) 
to inspect the
+  captured path. See [SET PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html).
   - Row producing functions such as `explode`

Review Comment:
   The new paragraphs (lines 143–151) are inserted inside the bulleted list of 
disallowed expression types, so `- Row producing functions such as `explode`` 
is now orphaned from the three bullets above (`Aggregate / Window / Ranking`). 
Kramdown/Jekyll will render this as two separate lists with body paragraphs 
between, which isn't the intent. Suggest moving the new paragraphs after the 
full list ends — e.g. after line 152, before the existing `Within the body of 
the function you can refer to parameter...` sentence.



##########
docs/sql-ref-function-current-path.md:
##########
@@ -0,0 +1,85 @@
+---
+layout: global
+title: current_path function
+displayTitle: current_path function
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Returns the effective SQL Path for the current session as a comma-separated 
string of
+qualified namespace names. See [`SET 
PATH`](sql-ref-syntax-aux-conf-mgmt-set-path.html) for a
+description of what the path is, how to enable it, and how to change it, and
+[Name Resolution](sql-ref-name-resolution.html) for how the path drives 
unqualified name
+resolution.
+
+### Syntax
+
+```sql
+current_path()
+```
+
+### Arguments
+
+This function takes no arguments. The parentheses may be omitted.
+
+### Returns
+
+A non-nullable `STRING`. Each path entry is written as a dotted name with 
backticks added only
+where required by Spark's identifier rules. Entries are separated by a single 
comma.
+
+When the path contains the virtual `CURRENT_SCHEMA` marker, the marker is 
materialized as the
+catalog-qualified current schema (`current_catalog.current_schema`) each time
+`current_path()` is evaluated, so subsequent `USE SCHEMA` statements are 
reflected without
+re-issuing `SET PATH`.
+
+### Examples
+
+```sql
+> SELECT current_path();
+ system.builtin,system.session,spark_catalog.default
+
+-- ANSI no-parens form returns the same value.
+> SELECT CURRENT_PATH;
+ system.builtin,system.session,spark_catalog.default
+
+-- The output reflects the latest SET PATH.
+> SET PATH = spark_catalog.default, system.builtin;
+> SELECT current_path();
+ spark_catalog.default,system.builtin
+
+-- CURRENT_SCHEMA on the path is re-evaluated on every call.
+> SET PATH = CURRENT_SCHEMA, system.builtin;
+> USE spark_catalog.finance;
+> SELECT current_path();
+ spark_catalog.finance,system.builtin
+> USE spark_catalog.default;
+> SELECT current_path();
+ spark_catalog.default,system.builtin
+
+-- Inside a persisted view or SQL function body, current_path() returns the 
invoker's path,

Review Comment:
   `persisted` → `persistent`; everywhere else in this PR (`set-path.md:56`, 
`create-view.md`, `name-resolution.md:291`, `create-sql-function.md`) uses 
"persistent view".
   
   ```suggestion
   -- Inside a persistent view or SQL function body, current_path() returns the 
invoker's path,
   ```



##########
docs/sql-ref-name-resolution.md:
##########
@@ -256,37 +259,54 @@ This restriction also applies to parameter references in 
SQL functions.
   frm.a  lat.b  func.c
 ```
 
-## Table and view resolution
-
-An identifier in table-reference can be any one of the following:
+## Object name resolution
 
-- Persistent table or view
-- Common table expression (CTE)
-- [Temporary view](sql-ref-syntax-ddl-create-view.html)
+Tables, views, and functions follow the same resolution rule. It depends on 
how many parts the
+identifier has.
 
-Resolution of an identifier depends on whether it is qualified:
+### Fully qualified (3 parts) &mdash; `catalog.schema.object`
 
-- **Qualified**
+The reference is unique and is looked up in `catalog.schema`. 
`system.builtin.object` identifies
+a built-in function; `system.session.object` identifies a temporary view, 
function, or session
+variable.
 
-  If the identifier is fully qualified with three parts: 
`catalog.schema.relation`, it is unique.
+### Partially qualified (2 parts) &mdash; `schema.object`
 
-  If the identifier consists of two parts: `schema.relation`, it is further 
qualified with the result of `SELECT current_catalog()` to make it unique.
+The identifier is qualified with `current_catalog` &mdash; producing
+`current_catalog.schema.object` &mdash; unless the leading part is `session` 
(or `builtin`, for
+functions). In that case Spark uses the
+[mini-path](sql-ref-identifier.html#reserved-system-names) to choose the 
implicit catalog,

Review Comment:
   Link text says "mini-path" but the target section in 
`sql-ref-identifier.md#reserved-system-names` never uses or defines that term — 
a reader following the link won't find it. Either introduce the "mini-path" 
name in the Reserved system names section (e.g. as the term for the 
system-vs-current-catalog fallback) or rename the link text to something 
present in the target (e.g. "[the rules in Reserved system names]").



##########
docs/sql-ref-name-resolution.md:
##########
@@ -420,4 +435,35 @@ If the function cannot be resolved Spark raises an 
`UNRESOLVED_ROUTINE` error.
 -- To resolve the persistent function it now needs qualification
 > SELECT spark_catalog.default.func(4, 3);
  6
+
+-- A built-in can always be reached by qualification, even when shadowed
+> CREATE TEMPORARY FUNCTION abs() RETURNS INT RETURN 999;
+> SELECT abs(-5);

Review Comment:
   The example captions itself "can always be reached by qualification, even 
when shadowed", but the temp is created as zero-arg `abs()` while the 
unqualified call is `abs(-5)` (one-arg). With the default PATH 
(`system.builtin, system.session, current_schema`), `abs(-5)` resolves to the 
built-in because builtin precedes session in the path — not because 
qualification reaches around a shadow. The example does demonstrate the 
qualification rule (`session.abs()`, `builtin.abs(-5)`, 
`system.builtin.abs(-5)` all show), but the unqualified line doesn't actually 
illustrate shadowing.
   
   A matching `(x INT)` signature for the temp `abs` plus a PATH like `SET PATH 
= system.session, system.builtin` would let `SELECT abs(-5)` resolve to the 
temp (999), and the qualified `system.builtin.abs(-5)` would then meaningfully 
reach "around" the shadow.



##########
docs/sql-ref-syntax-aux-describe-table.md:
##########
@@ -272,12 +276,88 @@ DESCRIBE customer salesdb.customer.name;
 +---------+----------+
 
 -- Returns the table metadata in JSON format.
+-- (Formatted for readability; the actual output is on a single line.)
 DESC FORMATTED customer AS JSON;
-{"table_name":"customer","catalog_name":"spark_catalog","schema_name":"default","namespace":["default"],"columns":[{"name":"cust_id","type":{"name":"integer"},"nullable":true},{"name":"name","type":{"name":"string"},"comment":"Short
 
name","nullable":true},{"name":"state","type":{"name":"varchar","length":20},"nullable":true}],"location":
 
"file:/tmp/salesdb.db/custom...","created_time":"2020-04-07T14:05:43Z","last_access":"UNKNOWN","created_by":"None","type":"MANAGED","provider":"parquet","partition_provider":"Catalog","partition_columns":["state"]}
+{
+  "table_name": "customer",
+  "catalog_name": "spark_catalog",
+  "schema_name": "default",
+  "namespace": ["default"],
+  "columns": [
+    {"name": "cust_id", "type": {"name": "integer"}, "nullable": true},
+    {"name": "name", "type": {"name": "string"}, "comment": "Short name", 
"nullable": true},
+    {"name": "state", "type": {"name": "varchar", "length": 20}, "nullable": 
true}
+  ],
+  "location": "file:/tmp/salesdb.db/custom...",
+  "created_time": "2020-04-07T14:05:43Z",
+  "last_access": "UNKNOWN",
+  "created_by": "None",
+  "type": "MANAGED",
+  "provider": "parquet",
+  "partition_provider": "Catalog",
+  "partition_columns": ["state"]
+}
+
+-- DESCRIBE EXTENDED on a view emits view-specific rows.
+SET PATH = spark_catalog.default, system.builtin;
+CREATE VIEW recent_customers AS
+    SELECT cust_id, name FROM customer WHERE cust_id > 1000;
+
+DESCRIBE EXTENDED recent_customers;
++----------------------------+---------------------------------------+--------+
+|                    col_name|                              data_type| comment|
++----------------------------+---------------------------------------+--------+
+|                     cust_id|                                    int|    null|
+|                        name|                                 string|    null|
+|                            |                                       |        |
+|# Detailed Table Information|                                       |        |
+|                    Catalog |                          spark_catalog|        |
+|                    Database|                                default|        |
+|                       Table|                       recent_customers|        |
+|                        Type|                                   VIEW|        |
+|                   View Text|SELECT cust_id, name FROM customer ... |        |
+|          View Original Text|SELECT cust_id, name FROM customer ... |        |
+|            View Schema Mode|                           COMPENSATION|        |
+| View Catalog and Namespace|                spark_catalog.default   |        |
+|   View Query Output Columns|                   [`cust_id`, `name`]  |        
|
+|                    SQL Path|   spark_catalog.default, system.builtin|        
|
++----------------------------+---------------------------------------+--------+
+
+-- The same metadata in JSON form.
+-- (Formatted for readability; the actual output is on a single line.)
+DESCRIBE EXTENDED recent_customers AS JSON;
+{
+  "table_name": "recent_customers",
+  "catalog_name": "spark_catalog",
+  "schema_name": "default",
+  "namespace": ["default"],
+  "columns": [
+    {"name": "cust_id", "type": {"name": "int"}, "nullable": true},

Review Comment:
   `cust_id` is shown as `"type": {"name": "integer"}` in the earlier `DESC 
FORMATTED customer AS JSON` block (line 287) but as `"type": {"name": "int"}` 
here, even though both reference the same column of the same `customer` table. 
Pick one spelling and use it in both examples (whichever Spark actually emits — 
`int` for the v2 JSON schema, `integer` was in the legacy example).



##########
docs/sql-ref-syntax-aux-conf-mgmt-set-path.md:
##########
@@ -0,0 +1,238 @@
+---
+layout: global
+title: SET PATH
+displayTitle: SET PATH
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+`SET PATH` changes the **SQL Path** of the current session.
+
+The SQL Path is an ordered list of catalog-qualified schema names that Spark 
walks when
+resolving unqualified references to functions, tables, views, and session 
variables in queries
+and DML (`SELECT`, `INSERT`, `UPDATE`, `DELETE`, `MERGE`). The first match 
wins. DDL
+(`CREATE TABLE`, `CREATE VIEW`, `CREATE FUNCTION`, `DROP`, `ALTER`, ...) 
resolves unqualified
+object names against `current_catalog.current_schema`, not the path; so 
`CREATE TABLE t` always
+creates `t` in the current schema regardless of the path.
+
+The path can include two virtual namespaces in the `system` catalog:
+
+- `system.builtin` &mdash; built-in functions, including those injected by
+  `SparkSessionExtensions`.
+- `system.session` &mdash; temporary views, temporary functions, and session 
variables in the
+  current session.
+
+`SET PATH` is controlled by `spark.sql.path.enabled`. When it is `false` (the 
default),
+`SET PATH` raises `UNSUPPORTED_FEATURE.SET_PATH_WHEN_DISABLED`. Unqualified 
resolution and
+[`current_path()`](sql-ref-function-current-path.html) still use the default 
path.
+
+The initial value of `PATH` in a session is `DEFAULT_PATH`. `DEFAULT_PATH` is 
either the value of
+`spark.sql.defaultPath`, or, when that configuration is empty, a built-in 
value composed of
+`system.builtin`, `system.session`, and the current schema. To override, set
+`spark.sql.defaultPath`. See the [`DEFAULT_PATH` parameter](#parameters) for 
the exact derivation
+rules.
+
+The effect of `SET PATH` is scoped to the current session and is lost when the 
session ends. To
+re-apply the current default path mid-session, run `SET PATH = DEFAULT_PATH`. 
(This stores a
+snapshot of `DEFAULT_PATH` at the moment of the statement; later changes to
+`spark.sql.defaultPath` are not picked up automatically.) Cloned sessions 
inherit the parent's
+path at clone time; later changes in the child do not propagate back.
+
+Persistent views and SQL UDFs capture the path at `CREATE` time into the 
object's metadata.
+Each invocation resolves the body against that frozen path, not the invoker's 
current path;
+`current_schema()` and `current_path()` inside the body still return the 
invoker's context.
+
+The leading names `session` and `builtin` have special meaning in 2-part 
references; see
+[Reserved system names](sql-ref-identifier.html#reserved-system-names).
+
+### Syntax
+
+```sql
+SET PATH = path_element [ , ... ]
+
+path_element
+    { DEFAULT_PATH |
+      SYSTEM_PATH |
+      PATH |
+      CURRENT_SCHEMA |
+      CURRENT_DATABASE |
+      catalog_name . schema_name }
+```
+
+### Parameters
+
+* **`DEFAULT_PATH`**
+
+  Expands to the session's default path. The default path has two layers:
+
+  1. If `spark.sql.defaultPath` is set to a non-empty value, that value is 
parsed using the same
+     grammar as `SET PATH` (with one restriction: the `PATH` keyword is not 
allowed inside the
+     conf value, since it would be self-referential).
+
+     The conf value is validated for syntax at the time it is set; an invalid 
value is rejected.
+     Static duplicates inside the conf are tolerated (unlike interactive `SET 
PATH`, which
+     rejects them) so a later `USE SCHEMA` cannot turn a previously valid 
default into a runtime
+     error. A `DEFAULT_PATH` token inside the conf value resolves to the 
spark-built-in default
+     below to avoid a cycle, rather than recursing.
+
+  2. If `spark.sql.defaultPath` is empty (the factory setting), the 
spark-built-in default
+     applies: `system.builtin`, `system.session`, and the current schema
+     (`current_catalog.current_schema`), in that order.
+
+  To change the default path, set `spark.sql.defaultPath` via any of the usual 
mechanisms
+  (`SET spark.sql.defaultPath = ...` at runtime, `--conf` on `spark-submit`, 
`SparkConf`, or
+  `spark-defaults.conf`); clear it with `RESET spark.sql.defaultPath` to 
return to the
+  spark-built-in default.
+
+* **`SYSTEM_PATH`**
+
+  Expands to the two system namespaces, `system.builtin` and `system.session`.
+
+* **`PATH`**
+
+  Expands to the **current** value of the SQL Path. Useful for appending 
entries without
+  re-typing them, for example `SET PATH = PATH, spark_catalog.analytics`.
+  `PATH` is not allowed in the value of `spark.sql.defaultPath` (it would 
create a cycle).
+
+* **`CURRENT_SCHEMA`** / **`CURRENT_DATABASE`**
+
+  A virtual marker that resolves to the catalog-qualified current schema
+  (`current_catalog.current_schema`) every time the path is consulted. This 
means subsequent
+  `USE SCHEMA` statements are picked up without re-issuing `SET PATH`.
+  `CURRENT_DATABASE` is a synonym for `CURRENT_SCHEMA`.
+
+* **`schema_name`**
+
+  An explicit catalog-qualified schema reference (`catalog.schema`). Both 
parts are required.

Review Comment:
   "Both parts are required" reads as "exactly two parts", but the grammar also 
accepts 3+ parts. `SetPathSuite` has a test "multi-level namespace (3+ parts) 
is accepted" (`SET PATH = spark_catalog.ns1.ns2, ...`), and the 
`INVALID_SQL_PATH_SCHEMA_REFERENCE` error itself reads `Use at least two name 
parts (catalog.schema); multi-level namespaces are allowed.` Suggest:
   
   ```suggestion
     An explicit catalog-qualified schema reference (`catalog.schema` or, for 
catalogs with multi-level namespaces, `catalog.ns1.ns2...`). At least two parts 
are required.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56984][DOCS] Document the SQL PATH feature [spark]

Reply via email to