Re: [D] Design and merge plan: operator output port result cache (MVP) [texera]

via GitHub Sun, 21 Jun 2026 22:29:48 -0700


GitHub user Xiao-zhen-Liu edited a discussion: Design and merge plan: operator 
output port result cache (MVP)


# Design and merge plan: operator output port result cache (MVP)

In current Texera main, the engine runs a workflow from the start every time, 
even when the user changed only one operator near the end. This proposal adds a 
result cache so that, on a re-run, an output port whose upstream computation 
logic is unchanged reads its saved result instead of recomputing it. The code 
is written and working on a [prototype 
branch](https://github.com/Xiao-zhen-Liu/texera/tree/xiaozhen-caching-prototype).
 This post describes the design and the plan to bring it into `main` as small 
PRs, so anyone can raise concerns before the PRs go up.

## Matching results across executions

Each output port has a **cache key** built from its upstream operators, their 
parameters, their output schemas, and the wiring between them. Two ports with 
the same cache key produce the same result (output port equivalence). When a 
run saves a port's result, we record `(workflow, port, cache key) -> result 
location`. On a later run, a port whose cache key has a recorded result is a 
**matched port**, and that result can be reused. Any edit upstream of a port 
changes its cache key, so its old result is no longer matched.

## Scope (MVP)

In scope: reuse the saved result at every matched port (**full reuse**), match 
by cache key, invalidate entries that no longer match after an edit, and read, 
write, and clear the cache from the UI.

Out of scope (future work, not in these PRs): choosing per port whether reuse 
is cheaper than recompute (cost-based reuse planning), and removing results 
under storage limits (eviction). The merged code always reuses a matched port's 
result.

## How it fits the current system

Current main (figure below): the Workflow Compiler builds a physical plan, 
`CostBasedScheduleGenerator` builds a schedule of regions, and the executor 
runs the regions on workers that read and write tables in storage.

<img width="1376" height="714" alt="op-port-cache-related-diagrams-0 execution 
components drawio" 
src="https://github.com/user-attachments/assets/ad5e9712-c213-4ab9-974f-029c08648964";
 />



With the cache MVP (figure below), the engine includes additional modules:

- **Cache service**: at submission, find the matched ports for this workflow 
(cache-key lookup); during execution, record cache metadata.
- **Skeleton generator**: remove the operators and edges whose results are 
reused, leaving the **run-skeleton**, the part that still needs to run.
- **Scheduler** (`CostBasedScheduleGenerator`): schedule the run-skeleton as it 
does today. The removed part becomes regions that are skipped, and operators 
that read from it use the saved result locations.

The executor saves results to the cached-result storage as ports finish.


<img width="1991" height="1599" alt="op-port-cache-related-diagrams-0 execution 
components (cache) drawio" 
src="https://github.com/user-attachments/assets/0825ef5e-ed91-4cfc-9bd4-e0d00a5f7c0c";
 />



## Nothing changes when there are no matched ports

On the first run of a workflow, or any run right after an upstream edit, there 
are no matched ports: the run-skeleton is the whole plan and the schedule is 
the same as today. The cache changes behavior only when a matched port exists, 
so the code can land inactive and turn on once results are saved.

## Merge plan: five PRs

In dependency order; each has its own issue:

1. **Storage foundation** (#5882): the cache table, the code that reads and 
writes it, and the cache-key computation.
2. **Cache state and statistics** (#5883): a "completed from cache" operator 
state and the matching statistics handling.
3. **Scheduler** (#5884): the reuse planner (full reuse), skeleton generator, 
and the change that schedules the run-skeleton and combines it with the skipped 
regions.
4. **Turn the feature on** (#5885): the submission-time cache-matcher lookup, 
saving results as ports finish, the cache endpoints, and cleanup on deletion.
5. **Frontend** (#5886): the cache panel and the canvas display.

PRs 1 and 2 are independent. PR 3 needs 1 and 2; PR 4 needs 3; PR 5 needs 4.



GitHub link: https://github.com/apache/texera/discussions/5880

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Design and merge plan: operator output port result cache (MVP) [texera]

Reply via email to