mengw15 opened a new issue, #5135:
URL: https://github.com/apache/texera/issues/5135

   ### Feature Summary
   
   Today Texera writes all execution outputs (`results`, `runtime_stats`, 
`console_logs`) into a single **global Iceberg warehouse** — see 
[`IcebergCatalogInstance.scala`](https://github.com/apache/texera/blob/a820f6727/common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/IcebergCatalogInstance.scala#L32),
 a process-wide singleton. One warehouse, all users share it, storage costs 
absorbed by the platform.
   
   This issue proposes a **per-user warehouse** model: each user registers one 
or more warehouses, each backed by **their own S3 bucket** (Bring-Your-Own-S3). 
Storage cost follows the data owner; users get tenant-isolated namespaces and 
tables.
   
   ### Background / Motivation
   
   - **Billing.** S3 cost should be attributed to the user who owns the data, 
not the platform.
   - **Isolation.** Per-tenant namespaces/tables, no shared blast radius.
   - **Builds on #4126** (Migrate to Result Service and MinIO for Execution 
Results) — that issue introduced the REST Catalog Service (Lakekeeper) layer. 
This issue is the next step: make Lakekeeper multi-tenant.
   
   ### Scope
   
   Per-user warehouses are scoped to the **Kubernetes deployment**. Local / 
single-node Docker Compose deployments continue to work as today: `PsqlCatalog` 
remains supported and unchanged, and `RestCatalog` mode keeps its current 
single global Lakekeeper warehouse (no per-user split).
   
   ### Proposed Solution or Design
   
   #### Data model
   
   ```
   User ─1:N→ Warehouse                (new)
   User ─1:N→ ComputingUnit            (existing)
   ComputingUnit ─1:N→ Execution       (existing)
   Warehouse ─1:N→ Execution           (new association)
   ```
   
   **ER diagram:** *(to be added)*
   
   #### Catalog hierarchy
   
   Promote `Catalog` from a process-wide singleton to a per-user/per-CU 
interface. Texera already has two `Catalog` implementations:
   
   ```
   Catalog (interface)
   ├── PsqlCatalog          — backed by PostgreSQL
   └── RestCatalog          — backed by any Iceberg REST Catalog service
   ```
   
   This design uses **`RestCatalog` with Lakekeeper** as the REST Catalog 
service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its 
own encrypted DB; **Texera never persists raw S3 creds**, only the Lakekeeper 
warehouse UUID and non-secret metadata.
   
   #### Flow A — Registering a warehouse
   
   1. User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / 
region / credentials.
   2. Backend posts the credentials directly to Lakekeeper to create the 
warehouse. **Creds never touch the Texera DB.**
   3. Lakekeeper returns the warehouse UUID; Texera stores the reference plus 
non-secret metadata.
   
   **Sequence diagram:** *(to be added)*
   
   #### Flow B — Binding a warehouse to a CU
   
   1. When the user creates a CU they pick which warehouse to use.
   2. Texera instantiates a `RestCatalog` for that CU using the chosen 
warehouse's Lakekeeper UUID — no global singleton on the hot path.
   
   Execution-time read/write paths (catalog ops via Lakekeeper, direct Parquet 
IO to the user's S3 bucket using Lakekeeper-vended credentials) follow the 
model in #4126, now scoped to the user's warehouse instead of the global one.
   
   **Sequence diagram (CU creation + RestCatalog instantiation):** *(to be 
added)*
   
   ### Open questions
   
   - Should a user own multiple warehouses, or exactly one? (Schema allows 
many; the picker UX may want a single primary.)
   - Shared workflows: when User A runs a workflow owned by User B, whose 
warehouse stores the results?
   - Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave 
S3 data orphaned in the user's bucket (Texera has no write access to user 
buckets), or soft-archive the catalog so existing executions stay readable 
until the user explicitly purges?
   
   ### Sub-tasks
   
   - [ ] Schema: add a `user_warehouse` table
   - [ ] Backend: REST endpoints for warehouse CRUD; Lakekeeper client wrapper
   - [ ] Backend: `Catalog` interface + `RestCatalog` impl; remove singleton 
assumption in `IcebergCatalogInstance`
   - [ ] Backend: thread per-CU `Catalog` through to executions (no global 
state)
   - [ ] CU plumbing: accept warehouse choice at CU creation; propagate to the 
worker
   - [ ] Frontend: new "Warehouse" dashboard tab
   - [ ] Frontend: warehouse picker in the CU creation flow
   - [ ] Docs: README + dashboard help text on BYO-S3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to