aicam opened a new pull request, #5086:
URL: https://github.com/apache/texera/pull/5086

   > **Heads-up for reviewers:** this is a hackathon-prototype branch and is 
**not intended to merge as-is**. It is being opened to share the work and 
invite feedback on the design. The branch contains a few things that would need 
to be split out before any landing attempt (local-dev `gui.conf`/`llm.conf` 
overrides, unrelated frontend ML-model experiments, k8s overrides, etc.). The 
core feature — a new \`MachineUDF\` operator + a per-host \`machine-manager\` 
service — is what we'd like opinions on.
   
   ## TL;DR
   
   A new Texera operator, **\`MachineUDF\`**, that runs user Python on a 
*user-registered remote machine* (or just the user's own laptop) rather than on 
a Texera computing unit. The machine runs a tiny **\`machine-manager\`** HTTP 
service we ship; the operator POSTs each tuple — or, in batch mode, the entire 
upstream table — to that service, gets back the result(s), and emits them as 
output tuples just like \`PythonUDFV2\` would. Datasets in LakeFS now also 
accept a literal \`latest\` segment in the file path so workflows don't break 
every time a new dataset version lands.
   
   Use cases this unlocks:
   
   - **Data already on the user's laptop?** Read it without uploading via S3 
first.
   - **Results need to land on the user's laptop?** Plots, reports, fine-tuned 
model files, JSON output — written directly to a path on the user's machine.
   - **Per-tuple side effects on a known host** (e.g. write each row as a JSONL 
file).
   - **Whole-table analytics that need a Python ML stack** the computing units 
don't necessarily have (sklearn, matplotlib, ...).
   
   End-to-end demo that shipped on this branch: an LLM agent reads 
\`/home/ali/UCI/hackathon/diabetes.csv\` on the laptop, builds a workflow on 
the canvas (\`CSVFileScan → MachineUDF[batch]\`), trains LinearRegression / 
Ridge / SVR predicting \`target\`, saves three prediction-vs-actual PNGs and a 
\`report.md\` back into \`/home/ali/UCI/hackathon/\`, and surfaces the metrics 
in the Texera result table.
   
   ## Architecture
   
   ```
     ┌──────────────┐         ┌─────────────────────────────┐
     │  Texera UI   │  HTTP   │  Texera backend services    │
     │  (Angular)   │ ──────► │  (amber, file-service, …)   │
     └──────┬───────┘         └──────────────┬──────────────┘
            │ JWT                            │ JDBC
            │                                ▼
            │                        ┌───────────────────┐
            │                        │   Postgres        │
            │                        │   + `machine` tbl │
            │                        └───────────────────┘
            │
            │  CRUD over /api/machines (new JAX-RS resource)
            │
     ┌──────▼───────────────────────────────────────────┐
     │  ComputingUnitMaster                             │
     │   • MachineUDFOpExec (new)                       │
     │     ── HTTP/1.1 POST /python ──┐                 │
     └────────────────────────────────│─────────────────┘
                                      │
                                      ▼
                   ┌─────────────────────────────────────┐
                   │   machine-manager  (FastAPI)        │
                   │     /healthz                        │
                   │     /exec        (bash, allowlist)  │
                   │     /python      (run script)       │
                   │     /deploy-code (sandboxed write)  │
                   │     /upload-to-dataset              │
                   └─────────────────────────────────────┘
                     runs on the user's own laptop
                     (or any registered host)
   ```
   
   The user registers a machine (URL + optional bearer token) via the new 
**Machines** dashboard tab; the operator looks up the URL from the \`machine\` 
table at runtime.
   
   ## What's in the PR
   
   ### 1. \`MachineUDF\` operator (Scala) — 
\`common/workflow-operator/.../udf/machine/\`
   - \`MachineUDFOpDesc\` — operator descriptor with properties \`machineUrl\`, 
\`machineToken\`, \`code\`, \`timeoutSeconds\`, \`batchMode\`, 
\`retainInputColumns\`, \`outputColumns\`.
   - \`MachineUDFOpExec\` — extends \`OperatorExecutor\` directly (not 
\`MapOpExec\`) so it can run **either** per-tuple **or** in **batch mode**.
     - **Per-tuple mode** (default): one \`POST /python\` per row, \`tuple_in\` 
is a single dict; last JSON line on stdout becomes the output tuple.
     - **Batch mode** (\`batchMode: true\`): buffer all tuples; on 
\`onFinish\`, ONE \`/python\` call with \`tuple_in\` as a list of dicts; each 
JSON object line on stdout becomes one output row, projected onto the declared 
\`outputColumns\`.
   - HTTP client pinned to **HTTP/1.1** — we hit a real bug where the JDK's 
default \`HttpClient\` does h2c upgrade and uvicorn drops the request body. The 
pinning eliminates that whole class of "POST sent but server received empty 
body" failures.
   - Output schema is preserved tuple-by-tuple in per-tuple mode (re-uses the 
original tuple's typed values for retained columns so e.g. Int stays Int 
instead of becoming Long after a JSON round-trip).
   
   Registered in \`LogicalOp.scala\` as the \`MachineUDF\` JSON-type, in the 
\`PYTHON_GROUP\` operator group.
   
   ### 2. \`machine-manager\` Python service — \`machine-manager/\` (new 
directory)
   
   FastAPI service the operator POSTs to. Endpoints:
   
   | Endpoint                 | Purpose                                         
                    |
   | ------------------------ | 
------------------------------------------------------------------- |
   | \`GET  /healthz\`          | Liveness + reports which Python interpreter 
\`/python\` uses          |
   | \`POST /exec\`             | Run a shell command (\`bash -c\`), capture 
stdout/stderr/exit_code   |
   | \`POST /python\`           | Run a Python script with \`tuple_in\` 
injected as a global             |
   | \`POST /deploy-code\`      | Write a file under a sandbox dir (path-escape 
protected)             |
   | \`POST /upload-to-dataset\`| Read a local file and push it into a Texera 
dataset (creating a new version) |
   
   Highlights:
   - **Token-optional bearer auth** via \`MACHINE_MANAGER_TOKEN\`. Skipped in 
dev when unset.
   - **Configurable interpreter** for \`/python\` via 
\`MACHINE_MANAGER_PYTHON\`. Defaults to whatever Python the service itself runs 
on, but the included \`bin/run.sh\` autodetects a Texera data-science venv 
(sklearn/pandas/matplotlib/numpy) and picks that so analysis scripts work out 
of the box.
   - **Pre-flight syntax check** on every \`/python\` call: we \`compile()\` 
the user script before spawning a subprocess. Caller gets \`exit_code=2\` + a 
one-line hint (about raw newlines in string literals being the typical cause). 
Saves a real round-trip when LLM-generated scripts are malformed.
   - \`/upload-to-dataset\` handles file-service's "No changes detected" 400 as 
a no-op and falls back to the latest version metadata, so re-uploads of 
identical content don't error.
   - \`tuple_in\` schema is \`Any\` — dict for per-tuple mode, list-of-dicts 
for batch mode, \`null\` for one-shot diagnostic calls.
   
   ### 3. Machine CRUD — \`amber/.../resource/MachineResource.scala\` + 
\`sql/texera_ddl.sql\` + frontend
   
   - New \`machine\` table (\`mid, uid, name, url, token, creation_time\`) — 
owner-scoped, JWT-gated.
   - New JAX-RS resource at \`/api/machines\` (list/create/update/delete).
   - New **Machines** dashboard tab — standalone Angular component with 
\`NzTable\` and an add/edit modal — lives next to Datasets in the sidebar.
   - Service + types under \`frontend/src/app/common/service/machine/\` and 
\`frontend/src/app/common/type/machine.ts\`.
   
   ### 4. \`FileResolver\` — \`latest\` version sentinel
   
   \`common/workflow-core/.../FileResolver.scala\` now accepts the literal 
segment \`latest\` in dataset paths:
   
   ```
   /<ownerEmail>/<datasetName>/latest/<filePath>      ← always resolves to the 
newest version
   /<ownerEmail>/<datasetName>/v3/<filePath>          ← still works, explicit 
version
   /<ownerEmail>/<datasetName>/<filePath>             ← 3-segment form; latest 
implied
   ```
   
   The "latest" form is resolved at execution time by querying 
\`dataset_version\` ordered by \`dvid DESC LIMIT 1\`. This is the path 
\`uploadFileToDataset\` returns to callers, which means once an LLM (or a 
human) wires a scan operator with a \`latest\` path, subsequent uploads to that 
dataset don't break the workflow.
   
   ### 5. Agent-service integration
   
   The \`agent-service\` (LLM-driven workflow builder) has been taught about 
all of the above:
   
   - New tools: \`runOnMachine\` (\`/exec\`), \`runPythonOnMachine\` 
(\`/python\`, currently de-registered to keep workflows as the only answer 
path), \`listDatasets\`, \`uploadFileToDataset\`, \`getDatasetFile\`. The 
upload tool returns a verbatim \`fileName_for_scan_operator\` string so the LLM 
doesn't have to construct the canonical dataset path itself.
   - Prompt rules added to prevent the common LLM failure modes we observed: 
guessing \`datasetId\` from a name, fabricating \`machineUrl\` hostnames, 
ignoring tool result paths, embedding raw newlines in single-/double-quoted 
Python strings (including f-strings), forgetting \`batchMode: true\` on 
whole-table MachineUDF scripts, re-running a workflow that already returned 
\`COMPLETED\`.
   - \`MachineUDF\`, the relevant Sklearn operators 
(\`SklearnTrainingLinearRegression\`, \`SklearnTrainingRidge\`, \`SVRTrainer\`, 
\`SklearnPrediction\`, \`Split\`), and \`PythonTableReducer\` are added to 
\`DEFAULT_AGENT_SETTINGS.allowedOperatorTypes\` so the agent can actually pick 
them.
   
   A worked example for the diabetes regression demo is embedded in the prompt 
verbatim, including a per-branch script template.
   
   ## How it was tested
   
   End-to-end demo, repeatedly: agent prompt → \`runOnMachine\` to verify the 
local CSV → \`uploadFileToDataset\` → \`addOperator\` \`CSVFileScan\` + batch 
\`MachineUDF\` → workflow execution. Result table populated with 3 rows 
(LinearRegression / Ridge / SVR metrics), 3 PNGs + \`report.md\` written to the 
laptop. Example output:
   
   | Model | R² | MSE |
   |---|---|---|
   | LinearRegression | 0.4753 | 3375.74 |
   | Ridge | 0.4838 | 3320.47 |
   | SVR | -0.2073 | 7766.64 |
   
   Failure modes we encountered and have either fixed or surfaced clearly:
   - JDK \`HttpClient\` h2c upgrade dropping the POST body → pin HTTP/1.1.
   - \`MachineUDF\` in batch mode being treated as per-tuple because the agent 
forgot \`batchMode: true\` → loud prompt rule + clear malformed-output 
signature.
   - LLM-generated Python scripts with raw newlines inside f-strings → 
server-side \`compile()\` pre-flight + explicit prompt rule.
   - LLM hallucinating \`http://<machine-name>:5555\` instead of using the URL 
returned by \`runOnMachine\` → hard prompt rule.
   
   ## Known limitations / things to discuss
   
   - The Apache license header is missing on the new \`machine-manager/\` 
Python files. Easy to add before any cleanup pass.
   - \`machine-manager\` has no rate limiting / process isolation beyond what 
FastAPI gives you. The \`/exec\` endpoint is a literal "run this bash command 
on the user's host" — useful for inspection from the canvas, but the threat 
model has to be "the user already trusts this Texera instance with their 
laptop".
   - \`MachineUDF\` operator has only one input port. Three-branch ML demos 
that want to fan three SklearnPrediction outputs into one collector currently 
need a separate \`MachineUDF\` per branch (or to standardize column names and 
Union upstream). A multi-port variant would be a nice follow-up.
   - The frontend changes include some sibling work (ML-model component, MLFlow 
operator schema) that would need to be split out for a real merge.
   - The agent-service prompt is currently authored for a specific demo flow 
(regression on diabetes.csv). A real merge would generalize the worked example.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to