This is an automated email from the ASF dual-hosted git repository.
aglinxinyuan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/texera.git
The following commit(s) were added to refs/heads/main by this push:
new 71ed5aa04c fix: Iceberg warehouse path mismatch between Python and
Java/Scala catalogs (#4409)
71ed5aa04c is described below
commit 71ed5aa04ccd349651a3b5503fbd23cc42ab78ef
Author: Xinyuan Lin <[email protected]>
AuthorDate: Mon Apr 20 13:29:48 2026 -0700
fix: Iceberg warehouse path mismatch between Python and Java/Scala catalogs
(#4409)
### What changes were proposed in this PR?
Iceberg tables created via the Python API could not be read back on the
Java/Scala side because the two runtimes were registering the Postgres
JDBC catalog with different warehouse values, which PyIceberg persists
into the table metadata.
The Python side (create_postgres_catalog in
amber/src/main/python/core/storage/iceberg/iceberg_utils.py) was
prefixing the same path with file://, so tables created by Python UDFs
were registered under file:///... while Scala-side lookups expected the
un-prefixed path.
This caused subsequent reads of Python-written Iceberg tables to fail
(wrong/unresolvable warehouse path in the metadata pointer).
Drop the file:// prefix in create_postgres_catalog so Python matches the
Scala catalog's warehouse value exactly. PyIceberg accepts a plain local
path here and will treat it as a local filesystem warehouse, consistent
with the Scala JdbcCatalog configuration.
### Any related issues, documentation, discussions?
Closes #4408
### How was this PR tested?
Added a test case and tested manually:
1. Create an Iceberg table from a Python UDF operator and confirm it can
be read back from the Scala/Java engine in the same workflow.
2. Re-run existing Iceberg-backed workflows (Python-write → Python-read
and Python-write → Scala-read) and confirm no regressions.
3. Verify on Windows that the warehouse path passed in (with colon
stripped) still resolves correctly from Python.
### Was this PR authored or co-authored using generative AI tooling?
No.
---
.../python/core/storage/iceberg/iceberg_utils.py | 2 +-
.../storage/iceberg/test_iceberg_utils_catalog.py | 98 ++++++++++++++++++++++
2 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/amber/src/main/python/core/storage/iceberg/iceberg_utils.py
b/amber/src/main/python/core/storage/iceberg/iceberg_utils.py
index f973c72fe8..844ef3e00f 100644
--- a/amber/src/main/python/core/storage/iceberg/iceberg_utils.py
+++ b/amber/src/main/python/core/storage/iceberg/iceberg_utils.py
@@ -148,7 +148,7 @@ def create_postgres_catalog(
catalog_name,
**{
"uri":
f"postgresql+pg8000://{username}:{password}@{uri_without_scheme}",
- "warehouse": f"file://{warehouse_path}",
+ "warehouse": warehouse_path,
},
)
diff --git
a/amber/src/main/python/core/storage/iceberg/test_iceberg_utils_catalog.py
b/amber/src/main/python/core/storage/iceberg/test_iceberg_utils_catalog.py
new file mode 100644
index 0000000000..902829d44c
--- /dev/null
+++ b/amber/src/main/python/core/storage/iceberg/test_iceberg_utils_catalog.py
@@ -0,0 +1,98 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from unittest.mock import patch
+
+from core.storage.iceberg import iceberg_utils
+from core.storage.iceberg.iceberg_utils import create_postgres_catalog
+
+
+class TestCreatePostgresCatalog:
+ """
+ Regression tests for `create_postgres_catalog`.
+
+ The Scala side (`IcebergUtil.createPostgresCatalog`) initializes the JDBC
+ catalog with a plain filesystem warehouse path (no URI scheme). PyIceberg
+ persists the `warehouse` property into table metadata, so if the Python
+ side registers the catalog with a `file://`-prefixed value, Iceberg tables
+ written from Python UDFs become unreadable from the Scala/Java engine
+ (and vice versa). These tests pin the Python side to the same plain-path
+ convention used on the Scala side.
+ """
+
+ def test_warehouse_is_passed_without_file_scheme(self):
+ """`warehouse` must be forwarded as-is, without a `file://` prefix."""
+ warehouse_path = "/tmp/texera/iceberg-warehouse"
+
+ with patch.object(iceberg_utils, "SqlCatalog") as mock_sql_catalog:
+ create_postgres_catalog(
+ catalog_name="texera_iceberg",
+ warehouse_path=warehouse_path,
+ uri_without_scheme="localhost:5432/texera_iceberg_catalog",
+ username="texera",
+ password="password",
+ )
+
+ assert mock_sql_catalog.call_count == 1
+ _, kwargs = mock_sql_catalog.call_args
+ assert kwargs["warehouse"] == warehouse_path
+ assert not kwargs["warehouse"].startswith("file://")
+
+ def test_windows_style_warehouse_is_passed_verbatim(self):
+ """
+ The Scala side strips the Windows drive colon (e.g. `C:/x` -> `C/x`)
+ before registering the catalog so PyArrow can parse the path. The
+ Python side should forward whatever it receives verbatim, so the two
+ runtimes agree on the warehouse string stored in Iceberg metadata.
+ """
+ warehouse_path = "C/Users/texera/iceberg-warehouse"
+
+ with patch.object(iceberg_utils, "SqlCatalog") as mock_sql_catalog:
+ create_postgres_catalog(
+ catalog_name="texera_iceberg",
+ warehouse_path=warehouse_path,
+ uri_without_scheme="localhost:5432/texera_iceberg_catalog",
+ username="texera",
+ password="password",
+ )
+
+ _, kwargs = mock_sql_catalog.call_args
+ assert kwargs["warehouse"] == warehouse_path
+ assert "file://" not in kwargs["warehouse"]
+
+ def test_postgres_uri_is_built_with_pg8000_scheme(self):
+ """The JDBC URI should be prefixed with `postgresql+pg8000://` and
+ include credentials; nothing about that should bleed into `warehouse`.
+ """
+ warehouse_path = "/var/lib/texera/warehouse"
+
+ with patch.object(iceberg_utils, "SqlCatalog") as mock_sql_catalog:
+ create_postgres_catalog(
+ catalog_name="texera_iceberg",
+ warehouse_path=warehouse_path,
+ uri_without_scheme="db.internal:5432/texera_iceberg_catalog",
+ username="texera",
+ password="s3cret",
+ )
+
+ args, kwargs = mock_sql_catalog.call_args
+ assert args == ("texera_iceberg",)
+ assert kwargs["uri"] == (
+
"postgresql+pg8000://texera:[email protected]:5432/texera_iceberg_catalog"
+ )
+ # And warehouse is still the plain path.
+ assert kwargs["warehouse"] == warehouse_path