Re: [PR] test(amber): move state-mat e2e into amber-integration [texera]

via GitHub Sun, 14 Jun 2026 15:42:22 -0700


Copilot commented on code in PR #5682:
URL: https://github.com/apache/texera/pull/5682#discussion_r3410376556



##########
amber/src/test/python/core/architecture/packaging/test_state_materialization_e2e.py:
##########
@@ -68,192 +74,195 @@
 )
 
 
-# Module-level scratch dir for the sqlite catalog + iceberg warehouse.
-# We don't initialize `StorageConfig` here: other test modules (e.g.
-# test_iceberg_document.py) also call `StorageConfig.initialize` at
-# import time, and the class rejects re-initialization with
-# RuntimeError. Whichever module gets collected first wins; we adopt
-# its namespaces below.
-_WAREHOUSE_DIR = tempfile.mkdtemp(prefix="texera-state-e2e-warehouse-")
[email protected]
+class TestStateMaterializationE2E:
+    @pytest.fixture(autouse=True)
+    def _init_storage_config(self):
+        """Initialize StorageConfig + IcebergCatalogInstance for the real
+        postgres-backed catalog in the `amber-integration` CI job.
 
+        Critical detail: the Scala integration tests that run earlier in
+        the same job connect to the iceberg catalog DB as user
+        `postgres/postgres` (the storage.conf default for
+        `STORAGE_ICEBERG_CATALOG_POSTGRES_USERNAME/PASSWORD`). pyiceberg
+        creates the catalog's `iceberg_tables` metadata table on first
+        use, owned by whoever wrote first — so it ends up owned by
+        `postgres`. We MUST connect as the same user, otherwise we hit
+        `permission denied for table iceberg_tables`.
 
[email protected](scope="module", autouse=True)
-def sqlite_iceberg_catalog():
-    """Inject a sqlite-backed SqlCatalog so the test runs without external
-    iceberg infra (postgres/minio).
+        Why the reset: `test_iceberg_document.py` also calls
+        `StorageConfig.initialize` at module import time (with a
+        different `texera/password` user that works for it because no
+        Scala writes first in the `pyamber` job where it runs). pytest
+        imports every test module during collection, even ones whose
+        tests will be deselected by `-m integration`, so that
+        initialization happens here too. We force-reset the singletons
+        and re-init with the prod-correct credentials; safe because
+        test_iceberg_document's tests are deselected from this run.
 
-    Module-scoped so all tests in this file share one warehouse, and so
-    namespace creation only happens once. We save/restore the original
-    `IcebergCatalogInstance` singleton so other test modules that expect
-    a real postgres-backed catalog (e.g. test_iceberg_document.py) are
-    not affected by our replacement.
-    """
-    # Some other test module may have initialized StorageConfig already
-    # (it has a single-init lock). If nothing has initialized it yet,
-    # do it here with arbitrary values -- we replace the catalog
-    # instance below so the postgres/rest fields are never exercised.
-    if not StorageConfig._initialized:
+        Credentials read the same `STORAGE_ICEBERG_CATALOG_POSTGRES_*`
+        env vars the production code consumes (via storage.conf), so
+        the test matches whichever identity the Scala side uses in the
+        same job. Defaults to `postgres/postgres` (the storage.conf
+        default that `amber-integration` runs with) when unset.
+        """
+        StorageConfig._initialized = False
+        IcebergCatalogInstance._instance = None
         StorageConfig.initialize(
             catalog_type="postgres",
-            postgres_uri_without_scheme="unused",
-            postgres_username="unused",
-            postgres_password="unused",
-            rest_catalog_uri="unused",
-            rest_catalog_warehouse_name="unused",
+            postgres_uri_without_scheme=os.environ.get(
+                "STORAGE_ICEBERG_CATALOG_POSTGRES_URI_WITHOUT_SCHEME",
+                "localhost:5432/texera_iceberg_catalog",
+            ),
+            postgres_username=os.environ.get(
+                "STORAGE_ICEBERG_CATALOG_POSTGRES_USERNAME", "postgres"
+            ),
+            postgres_password=os.environ.get(
+                "STORAGE_ICEBERG_CATALOG_POSTGRES_PASSWORD", "postgres"
+            ),
+            rest_catalog_uri="http://localhost:8181/catalog/";,
+            rest_catalog_warehouse_name="texera",
             table_result_namespace="operator-port-result",
             table_state_namespace="operator-port-state",
-            directory_path=_WAREHOUSE_DIR,
+            
directory_path=tempfile.mkdtemp(prefix="texera-state-e2e-warehouse-"),
             commit_batch_size=4096,
-            s3_endpoint="unused",
-            s3_region="unused",
-            s3_auth_username="unused",
-            s3_auth_password="unused",
+            s3_endpoint="http://localhost:9000";,
+            s3_region="us-east-1",
+            s3_auth_username="minioadmin",
+            s3_auth_password="minioadmin",

Review Comment:
   S3 settings here are hardcoded to `minioadmin/minioadmin` and `us-east-1`, 
but the integration job provisions MinIO using the storage.conf defaults 
(`texera_minio` / `password`, region `us-west-2`). If any part of the 
state-materialization path touches S3 (e.g., large-binaries manager), these 
credentials/region will mismatch the job’s infra and can cause CI-only 
failures. Read the same `STORAGE_S3_*` env vars that storage.conf uses (with 
defaults matching storage.conf) to keep the test aligned with the job 
configuration.



##########
amber/src/test/python/core/architecture/packaging/test_state_materialization_e2e.py:
##########
@@ -68,192 +74,195 @@
 )
 
 
-# Module-level scratch dir for the sqlite catalog + iceberg warehouse.
-# We don't initialize `StorageConfig` here: other test modules (e.g.
-# test_iceberg_document.py) also call `StorageConfig.initialize` at
-# import time, and the class rejects re-initialization with
-# RuntimeError. Whichever module gets collected first wins; we adopt
-# its namespaces below.
-_WAREHOUSE_DIR = tempfile.mkdtemp(prefix="texera-state-e2e-warehouse-")
[email protected]
+class TestStateMaterializationE2E:
+    @pytest.fixture(autouse=True)
+    def _init_storage_config(self):

Review Comment:
   `_init_storage_config` is function-scoped (default), so it force-resets the 
global StorageConfig/IcebergCatalogInstance and creates a new temp warehouse 
directory for every test method. Since these are global singletons, repeated 
reinitialization is unnecessary overhead and increases the chance of state 
leaking/flakiness across tests; a class-scoped fixture is sufficient for this 
file.



##########
.github/workflows/build.yml:
##########
@@ -316,25 +316,19 @@ jobs:
     # license check stay in `amber`; this job is tests-only.
     if: ${{ inputs.run_amber_integration }}
     strategy:
+      # macOS provisions postgres / minio / lakekeeper natively because
+      # GitHub-hosted macOS runners have no Docker (and `services:`
+      # containers are Linux-only). Each docker-dependent step below
+      # branches on $RUNNER_OS inside its `run:` script: Linux keeps
+      # the docker image, macOS uses brew + the upstream
+      # aarch64-apple-darwin lakekeeper tarball.
       matrix:
-        os: [ubuntu-22.04]
+        os: [ubuntu-22.04, macos-latest]
         java-version: [17]

Review Comment:
   The PR description says moving these e2e tests into the existing Ubuntu 
`amber-integration` job avoids paying for a separate macOS runner, but this 
change adds `macos-latest` to the `amber-integration` matrix. That will run the 
full Scala + Python integration suite on macOS and reintroduces macOS runner 
cost/time (likely more than the deleted diagnostic job). If macOS coverage is 
not a goal here, keep this job Ubuntu-only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] test(amber): move state-mat e2e into amber-integration [texera]

Reply via email to