GitHub user bobbai00 edited a discussion: Refactor: Decoupling Direct Database
Connection From ComputingUnitMaster & ComputingUnitWorker
## Motivation of this refactoring
Currently the amber engine directly depends on databases for execution status
update, workflow compilation and cost-based optimization. Therefore, DB
credentials have to be stored in Computing Unit as environment variables.
Removing the DB credentials from Computing Unit makes each CU a better, safer
sandbox environment for executing workflows, especially for workflows with UDF
operators.
## Refactor Overview
The fix is to move privileged database work out of the executor, not to harden
the sandbox.
```
Before: Frontend ─WS─▶ CU Master ─JDBC─▶ Postgres (UDFs share the creds)
After: Frontend ─WS─▶ CU Master ─HTTP(JWT)─▶ Backend ─JDBC─▶ Postgres
(no creds; runs a pre-resolved spec)
```
The Iceberg **REST catalog** already removes Iceberg metadata operations from
direct Postgres. The remaining dependencies are what I want to decouple and
discuss in this post
## Current System Architecture and Flow
Here are two diagrams describing the high-level traffic between services and
storages when a workflow is being executed:
Phase 1: Workflow (logical plan) is submitted to Computing Unit
<img width="3832" height="1652" alt="image"
src="https://github.com/user-attachments/assets/66a63d3d-465b-4fdc-b959-d21361c9a97c"
/>
Phase 2: Physical plan is being executed
<img width="3832" height="1652" alt="image"
src="https://github.com/user-attachments/assets/7fde0bda-8dbe-415a-9620-76b2fad27251"
/>
### Execution Flow
Every red arrow below is a direct database hit from computing unit
```mermaid
sequenceDiagram
autonumber
actor U as Frontend
participant M as CU Master
participant PG as Postgres
participant CAT as Catalog Service
participant S3I as S3 (Iceberg Tables)
U->>M: Run workflow (**logical plan** + JWT)
rect rgb(255, 235, 235)
M->>PG: Look up latest workflow version
M->>PG: Create execution record (get eid)
M->>PG: Clear previous run's output locations
opt fault tolerance
M->>PG: Save replay-log location
end
Note over M: compile **logical plan** → **physical plan** (in-process)
M->>PG: Resolve dataset paths to storage locations
Note over M: cost-based optimization (in-process)
M->>PG: Read last run's stats (cost-based optimization)
M->>PG: Record where results/console/stats go
end
M->>CAT: Commit result/console/stats tables (REST — no DB login)
M->>S3I: Write result/console/stats data
M-->>U: Live status, stats & errors (WebSocket)
```
## Proposed Design
### Design A — Proxy to the Backend
Keep the CU's logic in place; replace each `SqlServer` call with an HTTP call
that forwards the user's JWT. The Dashboard Service authorizes and runs the SQL.
```mermaid
sequenceDiagram
autonumber
actor U as Frontend
participant M as CU Master
participant DS as Dashboard Service
participant FS as file-service
participant PG as Postgres
U->>M: Run workflow (**logical plan** + JWT)
Note over M: no DB credentials
M->>DS: Create execution record (JWT)
DS->>PG: Create execution record
DS-->>M: eid
rect rgb(255, 245, 200)
Note over M: Compile & Execute
M->>FS: Resolve dataset paths (JWT)
FS->>PG: Look up datasets & versions
FS-->>M: resolved dataset URIs
Note over M: compile **logical plan** → **physical plan**, then execute (in
CU Master)
end
M->>DS: Report result/console/stats locations (JWT)
DS->>PG: Persist result/console/stats locations
M-->>U: Live status, stats & console (WebSocket)
```
*Yellow highlights compilation **and** execution — in Design A both happen
inside the CU Master.*
The main ideas of this design:
- The **CU Master keeps its current role** — it still **compiles the logical
plan → physical plan and executes**, both in-process — only its *database
access* is removed.
- Every former direct `SqlServer` call becomes an **HTTP call that forwards the
user's JWT**: execution-record create/update goes to the **Dashboard Service**,
dataset resolution goes to **file-service**, and each service authorizes the
caller before touching Postgres.
- **ComputingUnitMaster** still receives the **logical plan**, but resolves
datasets and reads/writes execution metadata over HTTP — it holds no DB
credentials and never touches Postgres directly.
### Design B — Pure Execution Backend *(My Preference)*
The CU becomes a stateless executor. It **receives** one self-contained
`ExecutionSpec` (the already-compiled physical plan + `eid` + last execution
stats), runs it, writes only to object storage through the REST catalog, and
reports a completion manifest. The backend (Dashboard Service +
workflow-compiling-service + file-service) does all compilation, dataset
resolution, `eid` allocation, and persistence.
```mermaid
sequenceDiagram
autonumber
actor FE as Frontend
participant WCS as workflow-compiling-service
participant DS as Dashboard Service
participant FS as file-service
participant PG as Postgres
participant CU as CU Master
participant CAT as Catalog Service
rect rgb(255, 245, 200)
Note over WCS: Compile
FE->>WCS: Compile workflow (**logical plan** + JWT)
WCS->>FS: Resolve dataset paths (access-checked)
FS->>PG: Look up datasets & versions
WCS-->>FE: **physical plan** (dataset URIs resolved)
end
FE->>DS: Prepare execution (wid, cuid, JWT)
DS->>PG: Create execution record & read last execution stats
DS-->>FE: eid + last execution stats
rect rgb(255, 245, 200)
Note over CU: Execute
FE->>CU: Run execution (**physical plan** + eid + last execution stats + JWT)
Note over CU: no DB credentials
CU->>CAT: Commit & write results (REST — no DB login)
end
CU-->>FE: Live status, stats & console (WebSocket)
CU->>DS: Report final status & manifest (JWT)
DS->>PG: Persist final status & locations
```
*Yellow highlights compilation (in workflow-compiling-service) and execution
(in CU Master) — split across services, unlike Design A.*
The main ideas of this design:
- The **frontend** asks **workflow-compiling-service** to compile the workflow
— which **resolves dataset paths via file-service** and bakes the resolved URIs
into the physical plan.
- It then asks the **Dashboard Service** to prepare the execution, passing only
the **workflow id/version + cuid** (not the plan). The Dashboard Service
creates the execution record, reads the **last execution stats** (used by the
engine for cost-based optimization), and returns the **`eid` + last execution
stats**. It never sees the physical plan and never resolves datasets.
- The **frontend assembles the run request** — physical plan (from the
compiling-service) + `eid` + last execution stats (from the Dashboard Service)
— and sends it to the CU over the WebSocket.
- **ComputingUnitMaster** receives the **physical plan**, executes it, streams
status to the frontend, and reports the final status/manifest back. It no
longer compiles the **logical plan** and never touches Postgres.
GitHub link: https://github.com/apache/texera/discussions/5295
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]