Xiao-zhen-Liu opened a new pull request, #5115:
URL: https://github.com/apache/texera/pull/5115

   # PR Description: AI-Augmented Macro Operators
   
   ## What changes were proposed in this PR?
   
   Introduces **macro operators** — a logical-plan-level abstraction for 
encapsulating reusable sub-DAGs as single nodes on the canvas — together with 
the AI surfaces that make them practical: a suggestion panel that scans the 
current workflow for refactor opportunities, a one-click "fuse for performance" 
path that collapses a macro body into a single Python UDF, and a drill-down 
editor for inspecting / editing a macro body while live execution stats keep 
flowing.
   
   ### Surface
   
   * **Right-click → "Create macro"** swaps a selected sub-DAG with a single 
`Macro` op on the parent canvas. The body is persisted as a separate 
`MACRO`-kind workflow row; references the persisted body via `(macroId, 
macroVersion)` in `LIVE` link mode (`SNAPSHOT` mode also supported for portable 
bodies).
   * **"Your Macros" palette section** lists every macro the user has saved; 
click → instantiate, hover → preview, right-click → export/import as portable 
JSON bundles (transitive — nested macros travel with their parent).
   * **"Suggest Macros (AI)" panel** (`MacroSuggestionService`) ranks two 
heuristic candidate sources side-by-side: linear chains, and recurring 
`(opType₁, opType₂, …)` patterns across the workflow. Recurring patterns 
auto-tier as `✓ recommended` because duplicated logic is the strongest "extract 
as macro" signal. Names are domain-aware (`csv_preprocessing`, 
`text_filtering`, `metric_summary`, `joined_enrichment`, `ml_train_eval`, …) 
rather than underscore-joined op types.
   * **Right-click → "fuse for performance (AI)"** (`MacroFusionService`) emits 
a `PythonUDFOpDescV2`-compatible `process_tuple` function from the macro body. 
Covers Filter, Projection, Regex, Limit, Distinct, and inlined PythonUDFV2 
(yield-rewritten). Marks `fusion.verified = true` so `MacroExpander` 
substitutes a single UDF for the inlined body at compile time. Visual: solid 
gold stroke + `⚡ FUSED · 1.6×` badge on canvas. Speedup is grounded in the 
handoff-removal model (N−1 internal actor boundaries collapsed; conservative 
×0.30 per handoff, capped at ×4).
   * **Drill-down editor** — double-clicking a macro op routes to 
`/dashboard/user/workflow/{wid}/macro/{macroId}?instance=…` and renders the 
body on a child canvas. The body is laid out with dagre using the same settings 
as the main canvas's auto-layout. Live execution stats (incl. row counts per 
port) flow into the drill-down view via a `(body-op-id → runtime-UUID)` map 
sourced from `MacroService`'s runtime-mapping cache.
   
   ### Architecture
   
   * **Macros live at the logical-plan layer only.** `MacroExpander` (mirrored 
in `amber/` and `workflow-compiling-service/`) runs a pre-compile pass that 
inlines every `MacroOpDesc` into its body operators and rewrites edges, so the 
physical-plan layer never sees a macro. `WorkflowCompiler` calls 
`MacroExpander.expand` before `expandLogicalPlan`.
   * **Deterministic UUIDs for inner ops.** The expander assigns each inner op 
a fresh ID via 
`UUID.nameUUIDFromBytes("${macroInstanceId}|${originalBodyOpId}")`. Required 
because (a) the long `${instanceId}--${innerOp}` prefix scheme produced 170+ 
char IDs that caused Iceberg commit thrash on HashJoin's internal build-side 
port, and (b) Texera has two compilers — frontend validation and execution-time 
— that must produce bit-identical plans, otherwise the macro-mapping side-table 
written by one wouldn't match the runtime stats emitted by the other.
   * **Provenance side-table.** `MacroExpander` populates a `Map[runtimeOpId → 
MacroProvenance(macroChain, bodyOpId)]` during expansion; `WorkflowCompiler` 
drains it after compile and stores it in `MacroMappingCache` (file-backed at 
`/tmp/texera-macro-mappings/wid-{wid}.json` for cross-JVM visibility between 
`ComputingUnitMaster` and `TexeraWebApplication`). Exposed via `GET 
/api/workflow/{wid}/macro-mapping`. Frontend 
`WorkflowStatusService.withMacroAggregates` walks the chain to roll inner-op 
stats up to every macro level (parent canvas + nested drill-downs).
   * **Nested macros are fully supported** — the chain stored per runtime op is 
`[outerInstanceId, innerInstanceId, …]`; the resolver suffix-matches the chain 
so a stats-roll-up rooted at an inner macro still finds its runtime ops.
   
   ### What this PR also fixes (along the way)
   
   * **View-result inside a macro** — drill-down result lookups now go 
body-relative-id → runtime-UUID via 
`MacroService.buildBodyOpIdToRuntimeUuidMap` (replaces the obsolete 
prefix-based alias). Mega-macros with 0 external outputs alias the canvas op to 
the first body sink, so the auto-stored terminal output is reachable without 
drilling.
   * **Back-to-parent stats** — `WorkflowStatusService` re-aggregates the 
cached raw status on every `runtimeMacroMappingTick`, and its emission Subject 
becomes `ReplaySubject(1)` so the canvas remount after navigation gets the 
latest snapshot immediately.
   * **Jackson `macroSyncedAt` UnrecognizedPropertyException** at execute time 
— `MacroOpDesc` annotated with `@JsonIgnoreProperties(ignoreUnknown = true)` so 
UI-only fields the frontend stamps onto operatorProperties don't break 
deserialization.
   * **`/api/macro/*` HTTP storm** — lazy fetches inside template bindings 
caused an infinite loop; reverted to a flat palette and removed the lazy 
fetches.
   * **Engine error visibility** — phase-transition errors and 
missing-schema-port errors now propagate out of `RegionExecutionCoordinator` 
instead of stalling silently.
   
   ## Any related issues, documentation, discussions?
   
   Hackathon submission. Builds on §9.2 of the macro design doc (AI fusion 
substitution path).
   
   ## How was this PR tested?
   
   * **`MacroExpanderSpec`** (~694 lines) covers the expander on its own: 
single-macro expansion, nested expansion (outer chain + inner chain), input 
fan-out (single external port → multiple inner consumers), output fan-in 
detection (raises), cycle detection across nested macros, depth-limit guard, 
deterministic-UUID property (same input → same output across compiler 
instances), and the provenance side-table population.
   * **`MacroOpDescSpec`** covers Jackson serialization round-trip (incl. 
tolerance of unknown frontend-only fields like `macroSyncedAt`).
   * **Demo runbook** — exercised the full path end-to-end on a real 
multi-macro workflow including nested macros, view-result on inner sinks, fuse 
+ unfuse, drill-down navigation in/out, and a "mega-macro" (entire workflow 
wrapped). Stats roll up correctly at every nesting level and the canvas remount 
after navigation no longer wipes non-macro op states.
   
   ```
   sbt "WorkflowExecutionService/testOnly *MacroExpanderSpec"
   sbt "WorkflowOperator/testOnly *MacroOpDescSpec"
   ```
   
   ## Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.7)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to