GitHub user bobbai00 created a discussion: Refactor: Decoupling Database
Connection From ComputingUnitMaster & ComputingUnitWorker
## Overview
Currently the amber engine directly depends on databases for execution status
update, workflow compilation and cost-based optimization. Therefore, DB
credentials have to be stored in Computing Unit as environment variables.
The fix is to move privileged database work out of the executor, not to harden
the sandbox.
```
Before: Frontend ─WS─▶ CU Master ─JDBC─▶ Postgres (UDFs share the creds)
After: Frontend ─WS─▶ CU Master ─HTTP(JWT)─▶ Backend ─JDBC─▶ Postgres
(no creds; runs a pre-resolved spec)
```
The Iceberg **REST catalog** already removes Iceberg metadata operations from
direct Postgres. The remaining dependencies are what I want to decouple and
discuss in this post
## Current System Architecture and Flow
Here are two diagrams describing the high-level traffic between services and
storages when a workflow is being executed:
Phase 1: Workflow (logical plan) is submitted to Computing Unit
<img width="3832" height="1652" alt="image"
src="https://github.com/user-attachments/assets/66a63d3d-465b-4fdc-b959-d21361c9a97c"
/>
Phase 2: Physical plan is being executed
<img width="3832" height="1652" alt="image"
src="https://github.com/user-attachments/assets/7fde0bda-8dbe-415a-9620-76b2fad27251"
/>
### Execution Flow
Every red arrow below is a direct database hit from computing unit
```mermaid
sequenceDiagram
autonumber
actor U as Frontend
participant M as CU Master
participant PG as Postgres
participant CAT as Catalog Service
participant S3I as S3 (Iceberg Tables)
U->>M: Run workflow (**logical plan** + JWT)
rect rgb(255, 235, 235)
M->>PG: Look up latest workflow version
M->>PG: Create execution record (get eid)
M->>PG: Clear previous run's output locations
opt fault tolerance
M->>PG: Save replay-log location
end
Note over M: compile **logical plan** → **physical plan** (in-process)
M->>PG: Resolve dataset paths to storage locations
Note over M: cost-based optimization (in-process)
M->>PG: Read last run's stats (cost-based optimization)
M->>PG: Record where results/console/stats go
end
M->>CAT: Commit result/console/stats tables (REST — no DB login)
M->>S3I: Write result/console/stats data
M-->>U: Live status, stats & errors (WebSocket)
```
## Proposed Design
### Design A — Proxy to the Backend
Keep the CU's logic in place; replace each `SqlServer` call with an HTTP call
that forwards the user's JWT. The Dashboard Service authorizes and runs the SQL.
```mermaid
sequenceDiagram
autonumber
actor U as Frontend
participant M as CU Master
participant DS as Dashboard Service
participant FS as file-service
participant PG as Postgres
U->>M: Run workflow (**logical plan** + JWT)
Note over M: no DB credentials
M->>DS: Create execution record (JWT)
DS->>PG: Create execution record
DS-->>M: eid
rect rgb(255, 245, 200)
Note over M: Compile & Execute
M->>FS: Resolve dataset paths (JWT)
FS->>PG: Look up datasets & versions
FS-->>M: resolved dataset URIs
Note over M: compile **logical plan** → **physical plan**, then execute (in
CU Master)
end
M->>DS: Report result/console/stats locations (JWT)
DS->>PG: Persist result/console/stats locations
M-->>U: Live status, stats & console (WebSocket)
```
*Yellow highlights compilation **and** execution — in Design A both happen
inside the CU Master.*
The main ideas of this design:
- The **CU Master keeps its current role** — it still **compiles the logical
plan → physical plan and executes**, both in-process — only its *database
access* is removed.
- Every former direct `SqlServer` call becomes an **HTTP call that forwards the
user's JWT**: execution-record create/update goes to the **Dashboard Service**,
dataset resolution goes to **file-service**, and each service authorizes the
caller before touching Postgres.
- **ComputingUnitMaster** still receives the **logical plan**, but resolves
datasets and reads/writes execution metadata over HTTP — it holds no DB
credentials and never touches Postgres directly.
### Design B — Pure Execution Backend *(My Preference)*
The CU becomes a stateless executor. It **receives** one self-contained
`ExecutionSpec` (the already-compiled physical plan + `eid` + last execution
stats), runs it, writes only to object storage through the REST catalog, and
reports a completion manifest. The backend (Dashboard Service +
workflow-compiling-service + file-service) does all compilation, dataset
resolution, `eid` allocation, and persistence.
```mermaid
sequenceDiagram
autonumber
actor FE as Frontend
participant WCS as workflow-compiling-service
participant DS as Dashboard Service
participant FS as file-service
participant PG as Postgres
participant CU as CU Master
participant CAT as Catalog Service
rect rgb(255, 245, 200)
Note over WCS: Compile
FE->>WCS: Compile workflow (**logical plan** + JWT)
WCS->>FS: Resolve dataset paths (access-checked)
FS->>PG: Look up datasets & versions
WCS-->>FE: **physical plan** (dataset URIs resolved)
end
FE->>DS: Prepare execution (wid, cuid, JWT)
DS->>PG: Create execution record & read last execution stats
DS-->>FE: eid + last execution stats
rect rgb(255, 245, 200)
Note over CU: Execute
FE->>CU: Run execution (**physical plan** + eid + last execution stats + JWT)
Note over CU: no DB credentials
CU->>CAT: Commit & write results (REST — no DB login)
end
CU-->>FE: Live status, stats & console (WebSocket)
CU->>DS: Report final status & manifest (JWT)
DS->>PG: Persist final status & locations
```
*Yellow highlights compilation (in workflow-compiling-service) and execution
(in CU Master) — split across services, unlike Design A.*
The main ideas of this design:
- The **frontend** asks **workflow-compiling-service** to compile the workflow
— which **resolves dataset paths via file-service** and bakes the resolved URIs
into the physical plan.
- It then asks the **Dashboard Service** to prepare the execution, passing only
the **workflow id/version + cuid** (not the plan). The Dashboard Service
creates the execution record, reads the **last execution stats** (used by the
engine for cost-based optimization), and returns the **`eid` + last execution
stats**. It never sees the physical plan and never resolves datasets.
- The **frontend assembles the run request** — physical plan (from the
compiling-service) + `eid` + last execution stats (from the Dashboard Service)
— and sends it to the CU over the WebSocket.
- **ComputingUnitMaster** receives the **physical plan**, executes it, streams
status to the frontend, and reports the final status/manifest back. It no
longer compiles the **logical plan** and never touches Postgres.
GitHub link: https://github.com/apache/texera/discussions/5295
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]