GitHub user Xiao-zhen-Liu edited a discussion: Design and merge plan: operator output port result cache (MVP)
# Design and merge plan: operator output port result cache (MVP) In current Texera main, the engine runs a workflow from the start every time, even when the user changed only one operator near the end. This proposal adds a result cache so that, on a re-run, an output port whose upstream computation logic is unchanged reads its saved result instead of recomputing it. The code is written and working on a [prototype branch](https://github.com/Xiao-zhen-Liu/texera/tree/xiaozhen-caching-prototype). This post describes the design and the plan to bring it into `main` as small PRs, so anyone can raise concerns before the PRs go up. ## Matching results across executions Each output port has a **cache key** built from its upstream operators, their parameters, their output schemas, and the wiring between them. Two ports with the same cache key produce the same result (output port equivalence). When a run saves a port's result, we record `(workflow, port, cache key) -> result location`. On a later run, a port whose cache key has a recorded result is a **matched port**, and that result can be reused. Any edit upstream of a port changes its cache key, so its old result is no longer matched. ## Scope (MVP) In scope: reuse the saved result at every matched port (**full reuse**), match by cache key, invalidate entries that no longer match after an edit, and read, write, and clear the cache from the UI. Out of scope (future work, not in these PRs): choosing per port whether reuse is cheaper than recompute (cost-based reuse planning), and removing results under storage limits (eviction). The merged code always reuses a matched port's result. ## How it fits the current system Current main (figure below): the Workflow Compiler builds a physical plan, `CostBasedScheduleGenerator` builds a schedule of regions, and the executor runs the regions on workers that read and write tables in storage. <img width="1376" height="714" alt="op-port-cache-related-diagrams-0 execution components drawio" src="https://github.com/user-attachments/assets/ad5e9712-c213-4ab9-974f-029c08648964" /> With the cache MVP (figure below), the engine includes additional modules: - **Cache service**: at submission, find the matched ports for this workflow (cache-key lookup); during execution, record cache metadata. - **Skeleton generator**: remove the operators and edges whose results are reused, leaving the **run-skeleton**, the part that still needs to run. - **Scheduler** (`CostBasedScheduleGenerator`): schedule the run-skeleton as it does today. The removed part becomes regions that are skipped, and operators that read from it use the saved result locations. The executor saves results to the cached-result storage as ports finish. <img width="1991" height="1599" alt="op-port-cache-related-diagrams-0 execution components (cache) drawio" src="https://github.com/user-attachments/assets/0825ef5e-ed91-4cfc-9bd4-e0d00a5f7c0c" /> ## Nothing changes when there are no matched ports On the first run of a workflow, or any run right after an upstream edit, there are no matched ports: the run-skeleton is the whole plan and the schedule is the same as today. The cache changes behavior only when a matched port exists, so the code can land inactive and turn on once results are saved. ## Merge plan: five PRs In dependency order; each has its own issue: 1. **Storage foundation** (#5882): the cache table, the code that reads and writes it, and the cache-key computation. 2. **Cache state and statistics** (#5883): a "completed from cache" operator state and the matching statistics handling. 3. **Scheduler** (#5884): the reuse planner (full reuse), skeleton generator, and the change that schedules the run-skeleton and combines it with the skipped regions. 4. **Turn the feature on** (#5885): the submission-time cache-matcher lookup, saving results as ports finish, the cache endpoints, and cleanup on deletion. 5. **Frontend** (#5886): the cache panel and the canvas display. PRs 1 and 2 are independent. PR 3 needs 1 and 2; PR 4 needs 3; PR 5 needs 4. GitHub link: https://github.com/apache/texera/discussions/5880 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
