[I] Per-user Iceberg warehouse with bring-your-own S3 storage [texera]

via GitHub Wed, 20 May 2026 00:17:35 -0700


mengw15 opened a new issue, #5135:
URL: https://github.com/apache/texera/issues/5135

### Feature Summary

Today Texera writes all execution outputs (`results`, `runtime_stats`,
`console_logs`) into a single **global Iceberg warehouse** — see
[`IcebergCatalogInstance.scala`](https://github.com/apache/texera/blob/a820f6727/common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/IcebergCatalogInstance.scala#L32),
a process-wide singleton. One warehouse, all users share it, storage costs
absorbed by the platform.

This issue proposes a **per-user warehouse** model: each user registers one
or more warehouses, each backed by **their own S3 bucket** (Bring-Your-Own-S3).
Storage cost follows the data owner; users get tenant-isolated namespaces and
tables.

### Background / Motivation

- **Billing.** S3 cost should be attributed to the user who owns the data,
not the platform.
- **Isolation.** Per-tenant namespaces/tables, no shared blast radius.
- **Builds on #4126** (Migrate to Result Service and MinIO for Execution
Results) — that issue introduced the REST Catalog Service (Lakekeeper) layer.
This issue is the next step: make Lakekeeper multi-tenant.

### Scope

Per-user warehouses are scoped to the **Kubernetes deployment**. Local /
single-node Docker Compose deployments continue to work as today: `PsqlCatalog`
remains supported and unchanged, and `RestCatalog` mode keeps its current
single global Lakekeeper warehouse (no per-user split).

### Proposed Solution or Design

#### Data model

```
User ─1:N→ Warehouse (new)
User ─1:N→ ComputingUnit (existing)
ComputingUnit ─1:N→ Execution (existing)
Warehouse ─1:N→ Execution (new association)
```

**ER diagram:** *(to be added)*

#### Catalog hierarchy

Promote `Catalog` from a process-wide singleton to a per-user/per-CU
interface. Texera already has two `Catalog` implementations:

```
Catalog (interface)
├── PsqlCatalog — backed by PostgreSQL
└── RestCatalog — backed by any Iceberg REST Catalog service
```

This design uses **`RestCatalog` with Lakekeeper** as the REST Catalog
service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its
own encrypted DB; **Texera never persists raw S3 creds**, only the Lakekeeper
warehouse UUID and non-secret metadata.

#### Flow A — Registering a warehouse

1. User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint /
region / credentials.
2. Backend posts the credentials directly to Lakekeeper to create the
warehouse. **Creds never touch the Texera DB.**
3. Lakekeeper returns the warehouse UUID; Texera stores the reference plus
non-secret metadata.

**Sequence diagram:** *(to be added)*

#### Flow B — Binding a warehouse to a CU

1. When the user creates a CU they pick which warehouse to use.
2. Texera instantiates a `RestCatalog` for that CU using the chosen
warehouse's Lakekeeper UUID — no global singleton on the hot path.

Execution-time read/write paths (catalog ops via Lakekeeper, direct Parquet
IO to the user's S3 bucket using Lakekeeper-vended credentials) follow the
model in #4126, now scoped to the user's warehouse instead of the global one.

**Sequence diagram (CU creation + RestCatalog instantiation):** *(to be
added)*

### Open questions

- Should a user own multiple warehouses, or exactly one? (Schema allows
many; the picker UX may want a single primary.)
- Shared workflows: when User A runs a workflow owned by User B, whose
warehouse stores the results?
- Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave
S3 data orphaned in the user's bucket (Texera has no write access to user
buckets), or soft-archive the catalog so existing executions stay readable
until the user explicitly purges?

### Sub-tasks

- [ ] Schema: add a `user_warehouse` table
- [ ] Backend: REST endpoints for warehouse CRUD; Lakekeeper client wrapper
- [ ] Backend: `Catalog` interface + `RestCatalog` impl; remove singleton
assumption in `IcebergCatalogInstance`
- [ ] Backend: thread per-CU `Catalog` through to executions (no global
state)
- [ ] CU plumbing: accept warehouse choice at CU creation; propagate to the
worker
- [ ] Frontend: new "Warehouse" dashboard tab
- [ ] Frontend: warehouse picker in the CU creation flow
- [ ] Docs: README + dashboard help text on BYO-S3

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Per-user Iceberg warehouse with bring-your-own S3 storage [texera]

Reply via email to