GitHub user bobbai00 edited a discussion: Refactor: Decoupling Direct Database 
Connection From ComputingUnitMaster & ComputingUnitWorker


## Motivation of this refactoring

Currently the amber engine directly queries the Postgres for execution status 
update, workflow compilation and cost-based optimization. Therefore, DB 
credentials have to be stored in Computing Unit as environment variables. 
Removing the DB credentials from Computing Unit makes each CU a better, safer 
sandbox environment for executing workflows, especially for workflows with UDF 
operators.

## Refactor Overview

The fix is to move privileged database work out of the executor, not to harden 
the sandbox.

```
Before:  Frontend ─WS─▶ CU Master ─JDBC─▶ Postgres        (UDFs share the creds)
After:   Frontend ─WS─▶ CU Master ─HTTP(JWT)─▶ Backend ─JDBC─▶ Postgres
                        (no creds; runs a pre-resolved spec)
```

The Iceberg **REST catalog** already removes Iceberg metadata operations from 
direct Postgres. The remaining dependencies are what I want to decouple and 
discuss in this post

## Current System Architecture and Flow

Here are two diagrams describing the high-level traffic between services and 
storages when a workflow is being executed:

Phase 1: Workflow (logical plan) is submitted to Computing Unit
<img width="3832" height="1652" alt="image" 
src="https://github.com/user-attachments/assets/66a63d3d-465b-4fdc-b959-d21361c9a97c";
 />

Phase 2: Physical plan is being executed
<img width="3832" height="1652" alt="image" 
src="https://github.com/user-attachments/assets/7fde0bda-8dbe-415a-9620-76b2fad27251";
 />

### Execution Flow

Every red arrow below is a direct database hit from computing unit

```mermaid
sequenceDiagram
  autonumber
  actor U as Frontend
  participant M as CU Master
  participant PG as Postgres
  participant CAT as Catalog Service
  participant S3I as S3 (Iceberg Tables)

  U->>M: Run workflow (**logical plan** + JWT)
  rect rgb(255, 235, 235)
  M->>PG: Look up latest workflow version
  M->>PG: Create execution record (get eid)
  M->>PG: Clear previous run's output locations
  opt fault tolerance
    M->>PG: Save replay-log location
  end
  Note over M: compile **logical plan** → **physical plan** (in-process)
  M->>PG: Resolve dataset paths to storage locations
  Note over M: cost-based optimization (in-process)
  M->>PG: Read last run's stats (cost-based optimization)
  M->>PG: Record where results/console/stats go
  end
  M->>CAT: Commit result/console/stats tables (REST — no DB login)
  M->>S3I: Write result/console/stats data
  M-->>U: Live status, stats & errors (WebSocket)
```

## Proposed Design

### Design A — Proxy to the Backend

Keep the CU's logic in place; replace each `SqlServer` call with an HTTP call 
that forwards the user's JWT. The Dashboard Service authorizes and runs the SQL.

```mermaid
sequenceDiagram
  autonumber
  actor U as Frontend
  participant M as CU Master
  participant DS as Dashboard Service
  participant FS as file-service
  participant PG as Postgres

  U->>M: Run workflow (**logical plan** + JWT)
  Note over M: no DB credentials
  M->>DS: Create execution record (JWT)
  DS->>PG: Create execution record
  DS-->>M: eid
  rect rgb(255, 245, 200)
  Note over M: Compile & Execute
  M->>FS: Resolve dataset paths (JWT)
  FS->>PG: Look up datasets & versions
  FS-->>M: resolved dataset URIs
  Note over M: compile **logical plan** → **physical plan**, then execute (in 
CU Master)
  end
  M->>DS: Report result/console/stats locations (JWT)
  DS->>PG: Persist result/console/stats locations
  M-->>U: Live status, stats & console (WebSocket)
```

*Yellow highlights compilation **and** execution — in Design A both happen 
inside the CU Master.*

The main ideas of this design:
- The **CU Master keeps its current role** — it still **compiles the logical 
plan → physical plan and executes**, both in-process — only its *database 
access* is removed.
- Every former direct `SqlServer` call becomes an **HTTP call that forwards the 
user's JWT**: execution-record create/update goes to the **Dashboard Service**, 
dataset resolution goes to **file-service**, and each service authorizes the 
caller before touching Postgres.
- **ComputingUnitMaster** still receives the **logical plan**, but resolves 
datasets and reads/writes execution metadata over HTTP — it holds no DB 
credentials and never touches Postgres directly.

### Design B — Pure Execution Backend *(My Preference)*

The CU becomes a stateless executor. It **receives** one self-contained 
`ExecutionSpec` (the already-compiled physical plan + `eid` + last execution 
stats), runs it, writes only to object storage through the REST catalog, and 
reports a completion manifest. The backend (Dashboard Service + 
workflow-compiling-service + file-service) does all compilation, dataset 
resolution, `eid` allocation, and persistence.

```mermaid
sequenceDiagram
  autonumber
  actor FE as Frontend
  participant WCS as workflow-compiling-service
  participant DS as Dashboard Service
  participant FS as file-service
  participant PG as Postgres
  participant CU as CU Master
  participant CAT as Catalog Service

  rect rgb(255, 245, 200)
  Note over WCS: Compile
  FE->>WCS: Compile workflow (**logical plan** + JWT)
  WCS->>FS: Resolve dataset paths (access-checked)
  FS->>PG: Look up datasets & versions
  WCS-->>FE: **physical plan** (dataset URIs resolved)
  end
  FE->>DS: Prepare execution (wid, cuid, JWT)
  DS->>PG: Create execution record & read last execution stats
  DS-->>FE: eid + last execution stats
  rect rgb(255, 245, 200)
  Note over CU: Execute
  FE->>CU: Run execution (**physical plan** + eid + last execution stats + JWT)
  Note over CU: no DB credentials
  CU->>CAT: Commit & write results (REST — no DB login)
  end
  CU-->>FE: Live status, stats & console (WebSocket)
  CU->>DS: Report final status & manifest (JWT)
  DS->>PG: Persist final status & locations
```

*Yellow highlights compilation (in workflow-compiling-service) and execution 
(in CU Master) — split across services, unlike Design A.*

The main ideas of this design:
- The **frontend** asks **workflow-compiling-service** to compile the workflow 
— which **resolves dataset paths via file-service** and bakes the resolved URIs 
into the physical plan.
- It then asks the **Dashboard Service** to prepare the execution, passing only 
the **workflow id/version + cuid** (not the plan). The Dashboard Service 
creates the execution record, reads the **last execution stats** (used by the 
engine for cost-based optimization), and returns the **`eid` + last execution 
stats**. It never sees the physical plan and never resolves datasets.
- The **frontend assembles the run request** — physical plan (from the 
compiling-service) + `eid` + last execution stats (from the Dashboard Service) 
— and sends it to the CU over the WebSocket.
- **ComputingUnitMaster** receives the **physical plan**, executes it, streams 
status to the frontend, and reports the final status/manifest back. It no 
longer compiles the **logical plan** and never touches Postgres.


GitHub link: https://github.com/apache/texera/discussions/5295

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to