bobbai00 opened a new issue, #5011:
URL: https://github.com/apache/texera/issues/5011
### Task Summary
## Motivation
CU Master and CU Worker run user-supplied UDF code. Today the CU pod
ships with the database credentials (`STORAGE_JDBC_URL`,
`STORAGE_JDBC_USERNAME`, `STORAGE_JDBC_PASSWORD`) in its environment so
the engine can read and write the metadata tables directly. Anything
that escapes the UDF sandbox can read those env vars and run arbitrary
SQL against the shared Postgres instance — read other users' workflows,
modify execution rows, drop data.
Removing the credentials from the executor closes that exposure. Web-app
should remain the only writer to the metadata DB; the executor should
hold no credentials.
## Current Usage
On `main`, `ComputingUnitMaster.run` opens a JDBC pool at startup:
```scala
SqlServer.initConnection(
StorageConfig.jdbcUrl,
StorageConfig.jdbcUsername,
StorageConfig.jdbcPassword
)
```
Once the pool is open, several engine and service code paths reach
Postgres directly via `SqlServer`:
| Area | File | What it does |
|---|---|---|
| Execution lifecycle | `web/service/WorkflowService.scala` |
`ExecutionsMetadataPersistService.insertNewExecution` (INSERT new row),
`tryUpdateExistingExecution` (UPDATE status, log_location, runtime_stats_uri,
result, etc.). |
| State transitions | `web/storage/ExecutionStateStore.scala`
(`updateWorkflowState`) | Persists every workflow state change
(READY/RUNNING/COMPLETED/FAILED/…). Called from many sites in `WorkflowService`
/ `WorkflowExecutionService`. |
| Operator/port URI registry |
`web/resource/.../WorkflowExecutionsResource.scala` |
`insertOperatorPortResultUri`, `insertOperatorConsoleUri`,
`getResultUriByLogicalPortId`, etc. Called by engine code (e.g.
`RegionExecutionCoordinator`) and by `SyncExecutionResource`. |
| Result/log cleanup | `web/ComputingUnitMaster.scala` (`cleanExecutions`,
`recurringCheckExpiredResults`) | On startup and on a recurring schedule,
queries `workflow_executions` for expired rows and updates their status. |
| Cost-based scheduling |
`engine/architecture/scheduling/CostEstimator.scala` |
`getOperatorExecutionTimeInSeconds` reads the latest successful
`workflow_executions.runtime_stats_uri` for a `wid`. |
| Dataset path resolution | `common/workflow-core/.../FileResolver.scala` |
`datasetResolveFunc` joins `USER × DATASET × DATASET_VERSION` to translate
`/owner/dataset/version/file` into a `dataset:///<repo>/<hash>/<file>` URI. Hit
during workflow compile. |
`ComputingUnitWorker.scala` itself is trivial (only calls
`AmberRuntime.startActorWorker`), but a Worker process shares the engine
code with the Master, so any of the engine-side call sites above
(notably `RegionExecutionCoordinator` and `CostEstimator`) execute
inside the Worker process when the corresponding actor is hosted there.
That is why Worker pods are also deployed with `STORAGE_JDBC_*` today.
## Proposed Design
Move every direct DB access reachable from CU Master / CU Worker behind
an HTTP service that owns the credentials. The executor holds no JDBC
config and authenticates each call by forwarding the originating user's
JWT.
```
┌─ web-app ──────────────┐
CU Master/Worker ──▶ │ (execution metadata) │ ──▶ Postgres
(JWT only) │ file-service (datasets)│
└─────────────────────────┘
```
## Roadmap
1. Inventory every `SqlServer` call site reachable from CU Master /
Worker (the table above is the starting point; double-check by
grepping the `WorkflowExecutionService` runtime classpath).
2. For each call site, define the HTTP contract that replaces it on
the appropriate owning service (web-app for execution metadata,
file-service for dataset resolution).
3. Establish the JWT-forwarding plumbing from each entry point on CU
Master through to the call site.
4. Migrate call sites one by one; each migration is independently
reviewable and behind a feature toggle if needed.
5. Remove `SqlServer.initConnection` from `ComputingUnitMaster.run`.
6. Drop `STORAGE_JDBC_*` from CU pod templates
(`bin/k8s/templates/workflow-computing-unit-*.yaml`,
`bin/single-node/docker-compose.yml`) and from any forwarding logic
in `ComputingUnitManagingResource` that pushes those vars into
spawned pods.
7. Add a smoke test that boots CU Master with `unset STORAGE_JDBC_*`
and runs a workflow end-to-end, locking the contract in CI.
### Task Type
- [x] Refactor / Cleanup
- [ ] DevOps / Deployment / CI
- [ ] Testing / QA
- [ ] Documentation
- [ ] Performance
- [ ] Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]