GitHub user bobbai00 created a discussion: Refactor: Decoupling Database 
Connection From ComputingUnitMaster & ComputingUnitWorker


## Overview

Currently the amber engine directly depends on databases for execution status 
update, workflow compilation and cost-based optimization. Therefore, DB 
credentials have to be stored in Computing Unit as environment variables.

The fix is to move privileged database work out of the executor, not to harden 
the sandbox.

```
Before:  Frontend ─WS─▶ CU Master ─JDBC─▶ Postgres        (UDFs share the creds)
After:   Frontend ─WS─▶ CU Master ─HTTP(JWT)─▶ Backend ─JDBC─▶ Postgres
                        (no creds; runs a pre-resolved spec)
```

The Iceberg **REST catalog** already removes Iceberg metadata operations from 
direct Postgres. The remaining dependencies are what I want to decouple and 
discuss in this post

## Current System Architecture and Flow

Here are two diagrams describing the high-level traffic between services and 
storages when a workflow is being executed:

Phase 1: Workflow (logical plan) is submitted to Computing Unit
<img width="3832" height="1652" alt="image" 
src="https://github.com/user-attachments/assets/66a63d3d-465b-4fdc-b959-d21361c9a97c";
 />

Phase 2: Physical plan is being executed
<img width="3832" height="1652" alt="image" 
src="https://github.com/user-attachments/assets/7fde0bda-8dbe-415a-9620-76b2fad27251";
 />

### Execution Flow

Every red arrow below is a direct database hit from computing unit

```mermaid
sequenceDiagram
  autonumber
  actor U as Frontend
  participant M as CU Master
  participant PG as Postgres
  participant CAT as Catalog Service
  participant S3I as S3 (Iceberg Tables)

  U->>M: Run workflow (**logical plan** + JWT)
  rect rgb(255, 235, 235)
  M->>PG: Look up latest workflow version
  M->>PG: Create execution record (get eid)
  M->>PG: Clear previous run's output locations
  opt fault tolerance
    M->>PG: Save replay-log location
  end
  Note over M: compile **logical plan** → **physical plan** (in-process)
  M->>PG: Resolve dataset paths to storage locations
  Note over M: cost-based optimization (in-process)
  M->>PG: Read last run's stats (cost-based optimization)
  M->>PG: Record where results/console/stats go
  end
  M->>CAT: Commit result/console/stats tables (REST — no DB login)
  M->>S3I: Write result/console/stats data
  M-->>U: Live status, stats & errors (WebSocket)
```

## Proposed Design

### Design A — Proxy to the Backend

Keep the CU's logic in place; replace each `SqlServer` call with an HTTP call 
that forwards the user's JWT. The Dashboard Service authorizes and runs the SQL.

```mermaid
sequenceDiagram
  autonumber
  actor U as Frontend
  participant M as CU Master
  participant DS as Dashboard Service
  participant FS as file-service
  participant PG as Postgres

  U->>M: Run workflow (**logical plan** + JWT)
  Note over M: no DB credentials
  M->>DS: Create execution record (JWT)
  DS->>PG: Create execution record
  DS-->>M: eid
  rect rgb(255, 245, 200)
  Note over M: Compile & Execute
  M->>FS: Resolve dataset paths (JWT)
  FS->>PG: Look up datasets & versions
  FS-->>M: resolved dataset URIs
  Note over M: compile **logical plan** → **physical plan**, then execute (in 
CU Master)
  end
  M->>DS: Report result/console/stats locations (JWT)
  DS->>PG: Persist result/console/stats locations
  M-->>U: Live status, stats & console (WebSocket)
```

*Yellow highlights compilation **and** execution — in Design A both happen 
inside the CU Master.*

The main ideas of this design:
- The **CU Master keeps its current role** — it still **compiles the logical 
plan → physical plan and executes**, both in-process — only its *database 
access* is removed.
- Every former direct `SqlServer` call becomes an **HTTP call that forwards the 
user's JWT**: execution-record create/update goes to the **Dashboard Service**, 
dataset resolution goes to **file-service**, and each service authorizes the 
caller before touching Postgres.
- **ComputingUnitMaster** still receives the **logical plan**, but resolves 
datasets and reads/writes execution metadata over HTTP — it holds no DB 
credentials and never touches Postgres directly.

### Design B — Pure Execution Backend *(My Preference)*

The CU becomes a stateless executor. It **receives** one self-contained 
`ExecutionSpec` (the already-compiled physical plan + `eid` + last execution 
stats), runs it, writes only to object storage through the REST catalog, and 
reports a completion manifest. The backend (Dashboard Service + 
workflow-compiling-service + file-service) does all compilation, dataset 
resolution, `eid` allocation, and persistence.

```mermaid
sequenceDiagram
  autonumber
  actor FE as Frontend
  participant WCS as workflow-compiling-service
  participant DS as Dashboard Service
  participant FS as file-service
  participant PG as Postgres
  participant CU as CU Master
  participant CAT as Catalog Service

  rect rgb(255, 245, 200)
  Note over WCS: Compile
  FE->>WCS: Compile workflow (**logical plan** + JWT)
  WCS->>FS: Resolve dataset paths (access-checked)
  FS->>PG: Look up datasets & versions
  WCS-->>FE: **physical plan** (dataset URIs resolved)
  end
  FE->>DS: Prepare execution (wid, cuid, JWT)
  DS->>PG: Create execution record & read last execution stats
  DS-->>FE: eid + last execution stats
  rect rgb(255, 245, 200)
  Note over CU: Execute
  FE->>CU: Run execution (**physical plan** + eid + last execution stats + JWT)
  Note over CU: no DB credentials
  CU->>CAT: Commit & write results (REST — no DB login)
  end
  CU-->>FE: Live status, stats & console (WebSocket)
  CU->>DS: Report final status & manifest (JWT)
  DS->>PG: Persist final status & locations
```

*Yellow highlights compilation (in workflow-compiling-service) and execution 
(in CU Master) — split across services, unlike Design A.*

The main ideas of this design:
- The **frontend** asks **workflow-compiling-service** to compile the workflow 
— which **resolves dataset paths via file-service** and bakes the resolved URIs 
into the physical plan.
- It then asks the **Dashboard Service** to prepare the execution, passing only 
the **workflow id/version + cuid** (not the plan). The Dashboard Service 
creates the execution record, reads the **last execution stats** (used by the 
engine for cost-based optimization), and returns the **`eid` + last execution 
stats**. It never sees the physical plan and never resolves datasets.
- The **frontend assembles the run request** — physical plan (from the 
compiling-service) + `eid` + last execution stats (from the Dashboard Service) 
— and sends it to the CU over the WebSocket.
- **ComputingUnitMaster** receives the **physical plan**, executes it, streams 
status to the frontend, and reports the final status/manifest back. It no 
longer compiles the **logical plan** and never touches Postgres.


GitHub link: https://github.com/apache/texera/discussions/5295

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to