MelihErduran opened a new pull request, #5079:
URL: https://github.com/apache/texera/pull/5079
# PR: One-command local dev orchestrator for Texera (`bin/texera`)
## Summary
Adds `bin/texera` — a single Bash CLI that replaces the previous "open 7
IntelliJ run configs in the right order, then `yarn start` in a different
terminal" workflow with one command:
```
texera start
```
It launches Postgres + LakeFS/MinIO + every backend JVM service +
agent-service + frontend, in the right order, with prefixed log streams in one
terminal, a live bottom-pinned health bar, and clean teardown on Ctrl+C. Also
ships subcommands for `setup`, `build`, `stop`, `status`, and `logs`.
Complementary helper `bin/check-services.sh` provides a one-shot probe outside
an active session.
This is a dev-tool addition — nothing about the services themselves changes.
The existing `.run/*.xml` IntelliJ configs and `bin/single-node` Docker deploy
paths are untouched.
## Motivation
Before this PR, getting Texera running locally required:
- Knowing the launch order (master before worker, infra before JVMs).
- Knowing the seven `bin/*-service.sh` scripts plus the un-scripted
agent-service and frontend.
- Eyeballing seven different terminals to figure out whether the stack was
actually up.
- Manually `pkill`ing JVMs when something crashed, because there was no
cleanup story.
- Hitting `file-service crashed on boot` ~50% of the time when LakeFS wasn't
quite ready.
New contributors hit all of this on day one. Existing contributors lived
with it but lost ~5 minutes per restart.
## What's in this PR
```
bin/texera new one-command orchestrator (~875 lines)
bin/check-services.sh new standalone health probe (~118 lines)
bin/build-services.sh mod add access-control-service to dist+unzip;
rename amber zip target (texera-*.zip →
amber-*.zip)
```
`bin/texera` is the main feature; the other two are small.
## Subcommands
```
texera setup One-time: toolchain check + sbt build +
frontend/python deps + SQL DDLs
texera build Re-build staged backend binaries (after backend code
changes)
texera start [mode] Start services. Interactive menu if no mode given.
texera stop Stop everything started by `texera start`.
texera status Per-service port reachability table.
texera logs <service> Tail one service's log file.
```
### `texera setup`
Idempotent first-time bootstrap. Verifies the toolchain (java 17, sbt, node
24, yarn, docker, pg_isready, psql, curl, unzip), runs `bin/build-services.sh`,
installs frontend (`yarn install`) and agent-service (`bun install`) deps,
applies `sql/texera_ddl.sql` and `sql/iceberg_postgres_catalog.sql`. Skips
agent-service gracefully if `bun` isn't installed.
### `texera build`
Delegates to `bin/build-services.sh` (`sbt clean dist` + unzip each
service's stage). Same path the deploy scripts use.
### `texera start`
Five modes, chosen by argument or interactive menu:
| Mode | Postgres + LakeFS/MinIO | Backend JVM services + agent | Frontend |
|---|---|---|---|
| `full` | ✓ | ✓ | ✓ |
| `backend` | ✓ | ✓ | — |
| `frontend` | — | — | ✓ |
| `infra` | ✓ | — | — |
| `services` | — | ✓ | ✓ |
The interactive menu (`texera start` with no arg, TTY only) renders a
box-drawn numbered prompt; `q` quits. Stdin not a TTY + no arg → errors with
the list of valid modes (so it's CI-safe).
Service registry is a single declarative table inside the script:
```bash
SERVICES=(
"config|.|target/config-service-*/bin/config-service|9094"
"compile|.|target/workflow-compiling-service-*/bin/workflow-compiling-service|9090"
"file|.|target/file-service-*/bin/file-service|9092"
"managing|.|target/computing-unit-managing-service-*/bin/computing-unit-managing-service|8888"
"access|.|target/access-control-service-*/bin/access-control-service|9096"
"master|amber|target/amber-*/bin/computing-unit-master|8085"
"worker|amber|target/amber-*/bin/computing-unit-worker|-"
"web|amber|target/amber-*/bin/texera-web-application|8080"
)
```
Adding a service later means adding one row.
Each row spawns from its sbt-native-packager staged binary (not `sbt
runMain`) — that avoids the sbt boot-lock contention you get from launching
several `sbt` processes in parallel and skips sbt startup overhead per service.
Each service's stdout/stderr is piped through a colored prefixer:
```
[config] INFO ConfigService starting…
[compile] INFO WorkflowCompilingService starting…
[file] ERROR Failed to connect to lake fs server: …
[master] [INFO] [ClusterListener] received member event = MemberUp(...)
```
Color is a stable hash of the service name → ANSI palette. Stream prefixer
is `awk -v p="$prefix" '{ print p, $0; fflush(); }'`.
Per-service logs also written to `logs/texera-dev/<name>.log` (so `texera
logs <name>` works mid-run).
### `texera stop`
`stop` kills every service launched by `texera start`, then `docker compose
down`s the LakeFS/MinIO stack.
The kill path matters because the previous scripts left orphan JVMs — see
the *Hard problems* section below.
### `texera status` and `texera logs`
`status` makes one `curl /api/healthcheck` per service and renders an
aligned table (`up`/`down`). Independent of any active `texera start`. Useful
for checking dev state from a different shell.
`logs <name>` is `tail -F logs/texera-dev/<name>.log`. Names come from the
same registry.
## Hard problems and how they're solved
### 1. Per-service liveness while logs scroll past
Spawning seven JVM services into one terminal means thousands of lines of
log spam during a normal boot. The user can't tell from the stream which
services are up.
**Solution: persistent bottom-pinned status bar.** When stdout is a TTY,
`status_bar_init` sets the terminal scroll region via DECSTBM (`ESC[1;LINES-3
r`), reserving the bottom 3 rows. A background poller redraws those rows every
2 s:
```
═════════════════════════════════════════════════════════════
✓ ALL 9 SERVICES UP (47s elapsed)
═════════════════════════════════════════════════════════════
```
or on failure:
```
═════════════════════════════════════════════════════════════
✗ 2/9 DOWN: master✗ file… (12s)
═════════════════════════════════════════════════════════════
```
Symbols: `✗` = pipeline collapsed (JVM exited), `…` = process alive but port
not yet bound.
The whole 3-row redraw is one `printf` with `DECSC`/`DECRC` (save/restore
cursor) around it, so concurrent log writes from the spawned services and the
poller don't interleave at byte level. Worst case is one garbled frame,
self-heals on the next 2 s tick.
When stdout *isn't* a TTY (CI, `| tee log.txt`, etc), `status_bar_supported`
returns false and the code falls back to a one-shot wait + trailing banner.
Teardown lives in two places: `shutdown` (Ctrl+C trap) calls it before
printing anything else so "shutting down…" lands in normal layout, and `trap
status_bar_teardown EXIT` is a belt-and-suspenders safety net so the terminal
is never left with a stuck scroll region even on an unexpected exit.
### 2. file-service vs LakeFS startup race
`file-service` calls `LakeFSStorageClient.healthCheck()` during boot
(`file-service/src/main/scala/.../FileService.scala:77`). If LakeFS isn't
accepting HTTP, the JVM exits.
`docker compose up -d` returns when the container is up, not when LakeFS's
HTTP server is accepting connections — a 5–15 s gap. So file-service crashed
~50% of the time on cold starts.
**Solution:** `start_lakefs` now polls `http://localhost:8000/_health`
(falling back to `/`) for up to 60 s after `docker compose up -d`, and only
returns once LakeFS answers. Both endpoints verified to return 200 against the
running container.
### 3. Orphan JVMs holding ports after stop
The previous `bin/*-service.sh` launchers and earlier iterations of
`bin/texera` recorded the wrong PID. The pipeline `( exec binary ) | tee log |
prefix_stream` ends up with `$!` = the awk PID. Killing awk does not propagate
to the JVM, which is a sibling, not a child. The fallback `pkill -f <basename>`
didn't help either, because the launcher script's filename
(`computing-unit-master` etc) isn't in the JVM's command line after `exec java
-cp …`.
Result: every `texera start` after the first failed with `BindException:
127.0.0.1:2552 Address already in use`, and you'd have to `lsof -ti :2552 |
xargs kill -9` manually.
**Solution: process groups.**
- Each `spawn_*` now toggles `set -m` briefly so the backgrounded pipeline
gets its own process group. With job control on, the PGID equals the PID of the
pipeline leader, which is the JVM subshell.
- `pgid_of_pipeline` reads it via `ps -o pgid= -p $!` and stores it in the
pidfile (so the pidfile effectively holds the JVM's PID, not awk's).
- `kill_all_pgids <grace>` does `kill -TERM -- -PGID` per recorded group →
SIGTERMs JVM + tee + awk together. Sleeps `grace` seconds. Then `kill -KILL --
-PGID` on any group still alive. Used by both `shutdown` (Ctrl+C, 2 s grace)
and `stop` (subcommand, 3 s grace).
- For JVMs left over from before this PR existed (no pidfile to consult),
`stop` also `pkill -f <mainclass>`s each known Java mainclass:
- `org.apache.texera.web.{ComputingUnitMaster, ComputingUnitWorker,
TexeraWebApplication}`
- `org.apache.texera.service.{ConfigService, FileService,
AccessControlService, ComputingUnitManagingService, WorkflowCompilingService}`
- List verified against `META-INF/MANIFEST.MF` in every built jar and the
`app_mainclass=` declarations in the amber launcher scripts.
Free side benefit: `is_spawn_alive` now `kill -0 PGID`s, which directly
probes the JVM leader rather than using awk's liveness as a proxy. The status
bar's "crashed" detection is precise.
### 4. Ordering constraints
`ComputingUnitMaster` must bind its Pekko/Akka cluster port before
`computing-unit-worker` tries to join. Encoded as one `sleep 4` after spawning
the master row. The launch loop walks `SERVICES` in declaration order, so the
table itself is the canonical ordering.
LakeFS comes before all JVM spawns because file-service depends on it;
Postgres comes before LakeFS because LakeFS uses it.
## File-by-file
- **`bin/texera`** — entire orchestrator. Sections: service registry, mode
table, color/printing, tool checks, infra (`ensure_postgres`, `start_lakefs`,
`stop_lakefs`), stream prefixer, spawns, `kill_all_pgids` + `shutdown` trap,
`status`/`logs` subcommands, `setup`/`build`, mode lookup + interactive menu,
readiness probes (`probe_port`, `is_spawn_alive`, `wait_for_services`,
`print_readiness_banner`), status bar, `start`, `stop`, dispatch.
- **`bin/check-services.sh`** — standalone one-shot probe of every service's
HTTP port. Independent of `texera start` session state. Prints a per-service
table + a green/red trailing banner, exits non-zero on any failure. Useful from
a second shell or in CI.
- **`bin/build-services.sh`** — minor: adds the `access-control-service`
unzip step that was missing, and renames the amber zip target from
`texera-*.zip` to `amber-*.zip` to match the new artifact name.
## What's intentionally not in scope
- The IntelliJ `.run/*.xml` configs still work; they're the path for
breakpoint debugging. `texera start` is for "I want everything running, fast."
- The `bin/single-node` Docker Compose deploy isn't touched.
- No CI hookup added. `texera start backend` works in non-TTY mode (banner
fallback), but no GitHub Actions job exercises it.
## Configuration knobs
- `TEXERA_READY_TIMEOUT` (default 90) — seconds the one-shot non-TTY
readiness check waits before giving up. The persistent bar polls forever; this
only applies to the fallback path.
- `TEXERA_HOST` (default `localhost`, used by `check-services.sh`) — host to
probe from.
- `TEXERA_PROBE_TIMEOUT` (default 2, used by `check-services.sh`) —
per-probe curl timeout.
LakeFS-ready timeout in `start_lakefs` is currently hard-coded at 60 s;
making it env-configurable is a small follow-up.
## Test plan
Verified locally (macOS, bash 3.2):
- [x] `texera setup` from a clean checkout, then `texera start full` → menu
→ mode 1 → all 9 services come up → bar flips green → frontend loads at `:4200`.
- [x] Ctrl+C while running → bar disappears, scroll region restored, JVMs
all exit within a couple seconds.
- [x] Immediate `texera start full` again → no port conflicts (the previous
orphan-JVM regression is gone).
- [x] `texera stop` from a separate shell while a `start` is running → both
terminals come back clean.
- [x] Kill `file-service` mid-run via `pkill -f FileService` → bar flips to
`✗ 1/9 DOWN: file✗ (… elapsed)` within 2 s.
- [x] `texera start infra` → only Postgres + LakeFS/MinIO come up; script
exits cleanly without blocking on `wait`.
- [x] `texera status` from a second terminal during a healthy run → all up.
- [x] `texera logs file` → tails `logs/texera-dev/file.log`.
- [x] PGID/group kill round-trip verified with a synthetic `sleep | cat |
awk` pipeline (`ps -o pid,pgid,comm -g <pgid>` empty after one TERM).
- [x] LakeFS probe endpoints (`/_health`, `/`) both verified to return 200.
- [x] All Java mainclass names verified against built jar manifests and
amber launcher scripts.
Not yet verified (follow-ups, see below):
- Cross-terminal: only tested in macOS Terminal.app + tmux. Behavior in
iTerm2, VS Code's embedded terminal, IntelliJ console, screen, etc. unverified.
- Terminal resize during a run (SIGWINCH).
- Headless / `texera start backend | tee` path through CI.
## Known limitations
- **TTY only for the live bar.** Non-TTY runs fall back to a one-shot
banner. This is intentional but means `texera start full | tee session.log`
won't show the live view.
- **Concurrent log writes can occasionally corrupt one bar frame.**
Single-printf renders mitigate but don't fully eliminate byte-level
interleaving on shared stdout. Self-heals on the next refresh.
- **Mainclass pkill in `stop` is broad.** If you have another checkout of
this repo running, `texera stop` here will kill that one too. Could be
tightened with `pkill -u "$USER"`; left as-is for now since most devs run a
single instance.
- **`set -m` semantics vary slightly across bash versions.** Verified on
macOS bash 3.2 and Linux bash 5.x; unusual non-POSIX shells aren't supported
(and the shebang is `#!/usr/bin/env bash` anyway).
## Follow-ups
Tracking these separately, not blocking this PR:
1. `bin/README.md` section documenting subcommands, modes, env vars, and the
status bar.
2. Make the LakeFS readiness timeout env-configurable
(`TEXERA_LAKEFS_TIMEOUT`).
3. `pkill -u "$USER"` on the orphan-mainclass fallback.
4. CI smoke job: `texera start backend` headless, assert exit code on
readiness.
5. `AGENTS.md` mention so subagents prefer `texera start` over the
`bin/*-service.sh` set when bringing the stack up.
## Migration notes
For existing contributors: nothing breaks. The old `bin/*-service.sh`
scripts, IntelliJ `.run/*.xml` configs, and `bin/single-node` deploy are
untouched and continue to work. `texera start` is opt-in.
The first time you use it: `texera setup` once, then `texera start`. If
you've ever Ctrl+C'd one of the old scripts and left an orphan JVM, run `texera
stop` first — its mainclass fallback will clean those up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]