mengw15 opened a new issue, #5135: URL: https://github.com/apache/texera/issues/5135
### Feature Summary Today Texera writes all execution outputs (`results`, `runtime_stats`, `console_logs`) into a single **global Iceberg warehouse** — see [`IcebergCatalogInstance.scala`](https://github.com/apache/texera/blob/a820f6727/common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/IcebergCatalogInstance.scala#L32), a process-wide singleton. One warehouse, all users share it, storage costs absorbed by the platform. This issue proposes a **per-user warehouse** model: each user registers one or more warehouses, each backed by **their own S3 bucket** (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables. ### Background / Motivation - **Billing.** S3 cost should be attributed to the user who owns the data, not the platform. - **Isolation.** Per-tenant namespaces/tables, no shared blast radius. - **Builds on #4126** (Migrate to Result Service and MinIO for Execution Results) — that issue introduced the REST Catalog Service (Lakekeeper) layer. This issue is the next step: make Lakekeeper multi-tenant. ### Scope Per-user warehouses are scoped to the **Kubernetes deployment**. Local / single-node Docker Compose deployments continue to work as today: `PsqlCatalog` remains supported and unchanged, and `RestCatalog` mode keeps its current single global Lakekeeper warehouse (no per-user split). ### Proposed Solution or Design #### Data model ``` User ─1:N→ Warehouse (new) User ─1:N→ ComputingUnit (existing) ComputingUnit ─1:N→ Execution (existing) Warehouse ─1:N→ Execution (new association) ``` **ER diagram:** *(to be added)* #### Catalog hierarchy Promote `Catalog` from a process-wide singleton to a per-user/per-CU interface. Texera already has two `Catalog` implementations: ``` Catalog (interface) ├── PsqlCatalog — backed by PostgreSQL └── RestCatalog — backed by any Iceberg REST Catalog service ``` This design uses **`RestCatalog` with Lakekeeper** as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB; **Texera never persists raw S3 creds**, only the Lakekeeper warehouse UUID and non-secret metadata. #### Flow A — Registering a warehouse 1. User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials. 2. Backend posts the credentials directly to Lakekeeper to create the warehouse. **Creds never touch the Texera DB.** 3. Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata. **Sequence diagram:** *(to be added)* #### Flow B — Binding a warehouse to a CU 1. When the user creates a CU they pick which warehouse to use. 2. Texera instantiates a `RestCatalog` for that CU using the chosen warehouse's Lakekeeper UUID — no global singleton on the hot path. Execution-time read/write paths (catalog ops via Lakekeeper, direct Parquet IO to the user's S3 bucket using Lakekeeper-vended credentials) follow the model in #4126, now scoped to the user's warehouse instead of the global one. **Sequence diagram (CU creation + RestCatalog instantiation):** *(to be added)* ### Open questions - Should a user own multiple warehouses, or exactly one? (Schema allows many; the picker UX may want a single primary.) - Shared workflows: when User A runs a workflow owned by User B, whose warehouse stores the results? - Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave S3 data orphaned in the user's bucket (Texera has no write access to user buckets), or soft-archive the catalog so existing executions stay readable until the user explicitly purges? ### Sub-tasks - [ ] Schema: add a `user_warehouse` table - [ ] Backend: REST endpoints for warehouse CRUD; Lakekeeper client wrapper - [ ] Backend: `Catalog` interface + `RestCatalog` impl; remove singleton assumption in `IcebergCatalogInstance` - [ ] Backend: thread per-CU `Catalog` through to executions (no global state) - [ ] CU plumbing: accept warehouse choice at CU creation; propagate to the worker - [ ] Frontend: new "Warehouse" dashboard tab - [ ] Frontend: warehouse picker in the CU creation flow - [ ] Docs: README + dashboard help text on BYO-S3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
