aicam opened a new pull request, #5086:
URL: https://github.com/apache/texera/pull/5086
> **Heads-up for reviewers:** this is a hackathon-prototype branch and is
**not intended to merge as-is**. It is being opened to share the work and
invite feedback on the design. The branch contains a few things that would need
to be split out before any landing attempt (local-dev `gui.conf`/`llm.conf`
overrides, unrelated frontend ML-model experiments, k8s overrides, etc.). The
core feature — a new \`MachineUDF\` operator + a per-host \`machine-manager\`
service — is what we'd like opinions on.
## TL;DR
A new Texera operator, **\`MachineUDF\`**, that runs user Python on a
*user-registered remote machine* (or just the user's own laptop) rather than on
a Texera computing unit. The machine runs a tiny **\`machine-manager\`** HTTP
service we ship; the operator POSTs each tuple — or, in batch mode, the entire
upstream table — to that service, gets back the result(s), and emits them as
output tuples just like \`PythonUDFV2\` would. Datasets in LakeFS now also
accept a literal \`latest\` segment in the file path so workflows don't break
every time a new dataset version lands.
Use cases this unlocks:
- **Data already on the user's laptop?** Read it without uploading via S3
first.
- **Results need to land on the user's laptop?** Plots, reports, fine-tuned
model files, JSON output — written directly to a path on the user's machine.
- **Per-tuple side effects on a known host** (e.g. write each row as a JSONL
file).
- **Whole-table analytics that need a Python ML stack** the computing units
don't necessarily have (sklearn, matplotlib, ...).
End-to-end demo that shipped on this branch: an LLM agent reads
\`/home/ali/UCI/hackathon/diabetes.csv\` on the laptop, builds a workflow on
the canvas (\`CSVFileScan → MachineUDF[batch]\`), trains LinearRegression /
Ridge / SVR predicting \`target\`, saves three prediction-vs-actual PNGs and a
\`report.md\` back into \`/home/ali/UCI/hackathon/\`, and surfaces the metrics
in the Texera result table.
## Architecture
```
┌──────────────┐ ┌─────────────────────────────┐
│ Texera UI │ HTTP │ Texera backend services │
│ (Angular) │ ──────► │ (amber, file-service, …) │
└──────┬───────┘ └──────────────┬──────────────┘
│ JWT │ JDBC
│ ▼
│ ┌───────────────────┐
│ │ Postgres │
│ │ + `machine` tbl │
│ └───────────────────┘
│
│ CRUD over /api/machines (new JAX-RS resource)
│
┌──────▼───────────────────────────────────────────┐
│ ComputingUnitMaster │
│ • MachineUDFOpExec (new) │
│ ── HTTP/1.1 POST /python ──┐ │
└────────────────────────────────│─────────────────┘
│
▼
┌─────────────────────────────────────┐
│ machine-manager (FastAPI) │
│ /healthz │
│ /exec (bash, allowlist) │
│ /python (run script) │
│ /deploy-code (sandboxed write) │
│ /upload-to-dataset │
└─────────────────────────────────────┘
runs on the user's own laptop
(or any registered host)
```
The user registers a machine (URL + optional bearer token) via the new
**Machines** dashboard tab; the operator looks up the URL from the \`machine\`
table at runtime.
## What's in the PR
### 1. \`MachineUDF\` operator (Scala) —
\`common/workflow-operator/.../udf/machine/\`
- \`MachineUDFOpDesc\` — operator descriptor with properties \`machineUrl\`,
\`machineToken\`, \`code\`, \`timeoutSeconds\`, \`batchMode\`,
\`retainInputColumns\`, \`outputColumns\`.
- \`MachineUDFOpExec\` — extends \`OperatorExecutor\` directly (not
\`MapOpExec\`) so it can run **either** per-tuple **or** in **batch mode**.
- **Per-tuple mode** (default): one \`POST /python\` per row, \`tuple_in\`
is a single dict; last JSON line on stdout becomes the output tuple.
- **Batch mode** (\`batchMode: true\`): buffer all tuples; on
\`onFinish\`, ONE \`/python\` call with \`tuple_in\` as a list of dicts; each
JSON object line on stdout becomes one output row, projected onto the declared
\`outputColumns\`.
- HTTP client pinned to **HTTP/1.1** — we hit a real bug where the JDK's
default \`HttpClient\` does h2c upgrade and uvicorn drops the request body. The
pinning eliminates that whole class of "POST sent but server received empty
body" failures.
- Output schema is preserved tuple-by-tuple in per-tuple mode (re-uses the
original tuple's typed values for retained columns so e.g. Int stays Int
instead of becoming Long after a JSON round-trip).
Registered in \`LogicalOp.scala\` as the \`MachineUDF\` JSON-type, in the
\`PYTHON_GROUP\` operator group.
### 2. \`machine-manager\` Python service — \`machine-manager/\` (new
directory)
FastAPI service the operator POSTs to. Endpoints:
| Endpoint | Purpose
|
| ------------------------ |
------------------------------------------------------------------- |
| \`GET /healthz\` | Liveness + reports which Python interpreter
\`/python\` uses |
| \`POST /exec\` | Run a shell command (\`bash -c\`), capture
stdout/stderr/exit_code |
| \`POST /python\` | Run a Python script with \`tuple_in\`
injected as a global |
| \`POST /deploy-code\` | Write a file under a sandbox dir (path-escape
protected) |
| \`POST /upload-to-dataset\`| Read a local file and push it into a Texera
dataset (creating a new version) |
Highlights:
- **Token-optional bearer auth** via \`MACHINE_MANAGER_TOKEN\`. Skipped in
dev when unset.
- **Configurable interpreter** for \`/python\` via
\`MACHINE_MANAGER_PYTHON\`. Defaults to whatever Python the service itself runs
on, but the included \`bin/run.sh\` autodetects a Texera data-science venv
(sklearn/pandas/matplotlib/numpy) and picks that so analysis scripts work out
of the box.
- **Pre-flight syntax check** on every \`/python\` call: we \`compile()\`
the user script before spawning a subprocess. Caller gets \`exit_code=2\` + a
one-line hint (about raw newlines in string literals being the typical cause).
Saves a real round-trip when LLM-generated scripts are malformed.
- \`/upload-to-dataset\` handles file-service's "No changes detected" 400 as
a no-op and falls back to the latest version metadata, so re-uploads of
identical content don't error.
- \`tuple_in\` schema is \`Any\` — dict for per-tuple mode, list-of-dicts
for batch mode, \`null\` for one-shot diagnostic calls.
### 3. Machine CRUD — \`amber/.../resource/MachineResource.scala\` +
\`sql/texera_ddl.sql\` + frontend
- New \`machine\` table (\`mid, uid, name, url, token, creation_time\`) —
owner-scoped, JWT-gated.
- New JAX-RS resource at \`/api/machines\` (list/create/update/delete).
- New **Machines** dashboard tab — standalone Angular component with
\`NzTable\` and an add/edit modal — lives next to Datasets in the sidebar.
- Service + types under \`frontend/src/app/common/service/machine/\` and
\`frontend/src/app/common/type/machine.ts\`.
### 4. \`FileResolver\` — \`latest\` version sentinel
\`common/workflow-core/.../FileResolver.scala\` now accepts the literal
segment \`latest\` in dataset paths:
```
/<ownerEmail>/<datasetName>/latest/<filePath> ← always resolves to the
newest version
/<ownerEmail>/<datasetName>/v3/<filePath> ← still works, explicit
version
/<ownerEmail>/<datasetName>/<filePath> ← 3-segment form; latest
implied
```
The "latest" form is resolved at execution time by querying
\`dataset_version\` ordered by \`dvid DESC LIMIT 1\`. This is the path
\`uploadFileToDataset\` returns to callers, which means once an LLM (or a
human) wires a scan operator with a \`latest\` path, subsequent uploads to that
dataset don't break the workflow.
### 5. Agent-service integration
The \`agent-service\` (LLM-driven workflow builder) has been taught about
all of the above:
- New tools: \`runOnMachine\` (\`/exec\`), \`runPythonOnMachine\`
(\`/python\`, currently de-registered to keep workflows as the only answer
path), \`listDatasets\`, \`uploadFileToDataset\`, \`getDatasetFile\`. The
upload tool returns a verbatim \`fileName_for_scan_operator\` string so the LLM
doesn't have to construct the canonical dataset path itself.
- Prompt rules added to prevent the common LLM failure modes we observed:
guessing \`datasetId\` from a name, fabricating \`machineUrl\` hostnames,
ignoring tool result paths, embedding raw newlines in single-/double-quoted
Python strings (including f-strings), forgetting \`batchMode: true\` on
whole-table MachineUDF scripts, re-running a workflow that already returned
\`COMPLETED\`.
- \`MachineUDF\`, the relevant Sklearn operators
(\`SklearnTrainingLinearRegression\`, \`SklearnTrainingRidge\`, \`SVRTrainer\`,
\`SklearnPrediction\`, \`Split\`), and \`PythonTableReducer\` are added to
\`DEFAULT_AGENT_SETTINGS.allowedOperatorTypes\` so the agent can actually pick
them.
A worked example for the diabetes regression demo is embedded in the prompt
verbatim, including a per-branch script template.
## How it was tested
End-to-end demo, repeatedly: agent prompt → \`runOnMachine\` to verify the
local CSV → \`uploadFileToDataset\` → \`addOperator\` \`CSVFileScan\` + batch
\`MachineUDF\` → workflow execution. Result table populated with 3 rows
(LinearRegression / Ridge / SVR metrics), 3 PNGs + \`report.md\` written to the
laptop. Example output:
| Model | R² | MSE |
|---|---|---|
| LinearRegression | 0.4753 | 3375.74 |
| Ridge | 0.4838 | 3320.47 |
| SVR | -0.2073 | 7766.64 |
Failure modes we encountered and have either fixed or surfaced clearly:
- JDK \`HttpClient\` h2c upgrade dropping the POST body → pin HTTP/1.1.
- \`MachineUDF\` in batch mode being treated as per-tuple because the agent
forgot \`batchMode: true\` → loud prompt rule + clear malformed-output
signature.
- LLM-generated Python scripts with raw newlines inside f-strings →
server-side \`compile()\` pre-flight + explicit prompt rule.
- LLM hallucinating \`http://<machine-name>:5555\` instead of using the URL
returned by \`runOnMachine\` → hard prompt rule.
## Known limitations / things to discuss
- The Apache license header is missing on the new \`machine-manager/\`
Python files. Easy to add before any cleanup pass.
- \`machine-manager\` has no rate limiting / process isolation beyond what
FastAPI gives you. The \`/exec\` endpoint is a literal "run this bash command
on the user's host" — useful for inspection from the canvas, but the threat
model has to be "the user already trusts this Texera instance with their
laptop".
- \`MachineUDF\` operator has only one input port. Three-branch ML demos
that want to fan three SklearnPrediction outputs into one collector currently
need a separate \`MachineUDF\` per branch (or to standardize column names and
Union upstream). A multi-port variant would be a nice follow-up.
- The frontend changes include some sibling work (ML-model component, MLFlow
operator schema) that would need to be split out for a real merge.
- The agent-service prompt is currently authored for a specific demo flow
(regression on diabetes.csv). A real merge would generalize the worked example.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]