[D] Cache Key Design for One-to-Many Logical-Physical Operator Mapping [texera]

via GitHub Thu, 02 Apr 2026 16:52:39 -0700


GitHub user Xiao-zhen-Liu created a discussion: Cache Key Design for 
One-to-Many Logical-Physical Operator Mapping

## Background: Current Design

Currently, there is a one-to-one mapping from a workflow's logical plan to its
physical plan. Each logical operator maps to one or more physical operators
with a fixed implementation choice. For example, `Sort` always maps to the
Python-based sort implementation, and `HashJoin` always maps to `Build` +
`Probe`.

This has two limitations:

1. **Logical operators expose physical choices.** The logical plan uses names
like `HashJoin` instead of `Join`, leaking the implementation into the
user-facing layer. Ideally, logical operator names should stay logical.
2. **No physical-plan alternatives.** A given logical operator always produces
the same physical operator(s). In principle, `Join` could be realized as
`HashJoin`, `SortMergeJoin`, or other implementations, and a cost-based
optimizer could choose among them.

## The Cache-Key Tradeoff Under One-to-Many Mapping

If we move to a one-to-many logical-to-physical mapping, the cache-key design
must decide at which level of the plan hierarchy to match cached results. There
are two options, each with a tradeoff.

**Option A: Cache key on the logical plan.**
A cached result is associated with a logical operator's output port and its
upstream logical sub-DAG. Matching uses logical-plan equality: if the logical
sub-DAGs are the same, the cached result can be reused regardless of the
physical implementation.

- *Advantage:* Maximum reuse across physical-plan choices. For example, if
Execution 1 realizes `Join` as `HashJoin` and caches the join output, Execution
2 can reuse that cached result even if the optimizer would otherwise choose
`SortMergeJoin`. The cache effectively preempts the physical-plan decision for
that operator.
- *Disadvantage:* Cannot exploit sub-operator granularity. A `HashJoin`
produces intermediate state (e.g., the build-side hash table) that could be
cached and reused by another `HashJoin`, but a logical-level cache key cannot
express this because `Build` is not visible at the logical level.

**Option B: Cache key on the physical plan.**
A cached result is associated with a physical operator's output port and its
upstream physical sub-DAG (this is the current design).

- *Advantage:* Fine-grained reuse of intermediate physical results (e.g., a
hash table from `Build`, a sorted run from `Sort`, a local aggregate from
`LocalAgg`).
- *Disadvantage:* No cross-implementation reuse. A `HashJoin`'s output cannot
match a `SortMergeJoin`'s output, even though they compute the same logical
result. If the optimizer changes the physical plan between executions, all
downstream cached results are invalidated.

**Correctness requirement:** Logical-level cache reuse assumes that all
physical implementations of a logical operator produce equivalent output (at
least bag-equal). If a downstream operator depends on output properties
specific to one implementation (e.g., sorted order from `SortMergeJoin`), the
optimizer must account for this when deciding whether to use a logical-level
cached result.

## Proposed Extension: Cache Keys at Both Levels

Under a one-to-many mapping, we can support cache keys at both the logical and
physical levels to combine the benefits:

1. **Compute cache keys at both levels.** Each logical-plan output port gets a
logical cache key (based on the upstream logical sub-DAG and operator
configurations), and each physical-plan output port gets a physical cache key
(based on the upstream physical sub-DAG), as in the current design.

2. **Pass logical-level matches to the physical-plan optimizer.** Before
physical-plan optimization, the system checks which logical output ports have
cached results. These matches are passed to the physical-plan optimizer as
additional input, since a logical-level match may make certain physical-plan
choices unnecessary or change their relative cost.

3. **The optimizer considers both sources of reuse.** Each logical-plan output
port maps uniquely to a physical-plan output port, regardless of which physical
plan is chosen (see the figure below). The optimizer can therefore account for
cache matches from both levels. A logical-level match provides the final output
for free; a physical-level match provides a reusable intermediate result within
the chosen implementation.

![cache-key-logical-physical-mapping](https://github.com/user-attachments/assets/c2a49bf6-0179-4242-8e36-dd6c7fa9292f)

In the figure above, the logical `Join` operator has output port `o1`. In
Physical Plan 1 (HashJoin), `Join.o1` maps to `Probe.o1`. In Physical Plan 2
(SortMergeJoin), `Join.o1` maps to `Merge.o1`. A logical cache key for
`Join.o1` matches in both physical plans via this port mapping. Physical cache
keys, on the other hand, do not match across implementations: `Probe.o1 ≠
Merge.o1`. Meanwhile, `Build.o1` is a physical-only port with no logical
counterpart, enabling finer-grained reuse within the HashJoin implementation.

## Relation to the Current One-to-One Design

Under the current one-to-one mapping, the physical plan has strictly *more*
cacheable locations than the logical plan. Physical operators that are internal
to a logical operator (such as `Build` within `HashJoin`) have output ports
with their own cache keys, but these ports have no logical-plan counterpart. So
even under one-to-one mapping, physical-plan cache keys provide finer-grained
reuse than logical-plan cache keys alone.

The proposed extension adds logical-level cache keys on top of the existing
physical-level keys. This only becomes useful under one-to-many
logical-to-physical mapping, where the cross-implementation reuse benefit of
logical keys comes into play.

## Note on Reuse Policy

When a cached result matches (at either level), the system always reuses it
rather than recomputing. This is a simplifying assumption; there are workloads
where recomputation would be cheaper (e.g., when a cached result is large and
expensive to read from storage). A cost-based reuse decision is an interesting
direction but is outside the scope of this design.

GitHub link: https://github.com/apache/texera/discussions/4346

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] Cache Key Design for One-to-Many Logical-Physical Operator Mapping [texera]

Reply via email to