GitHub user Xiao-zhen-Liu created a discussion: Cache Key Design for 
One-to-Many Logical-Physical Operator Mapping

## Background: Current Design

Currently, there is a one-to-one mapping from a workflow's logical plan to its 
physical plan. Each logical operator maps to one or more physical operators 
with a fixed implementation choice. For example, `Sort` always maps to the 
Python-based sort implementation, and `HashJoin` always maps to `Build` + 
`Probe`.

This has two limitations:

1. **Logical operators expose physical choices.** The logical plan uses names 
like `HashJoin` instead of `Join`, leaking the implementation into the 
user-facing layer. Ideally, logical operator names should stay logical.
2. **No physical-plan alternatives.** A given logical operator always produces 
the same physical operator(s). In principle, `Join` could be realized as 
`HashJoin`, `SortMergeJoin`, or other implementations, and a cost-based 
optimizer could choose among them.

## The Cache-Key Tradeoff Under One-to-Many Mapping

If we move to a one-to-many logical-to-physical mapping, the cache-key design 
must decide at which level of the plan hierarchy to match cached results. There 
are two options, each with a tradeoff.

**Option A: Cache key on the logical plan.**
A cached result is associated with a logical operator's output port and its 
upstream logical sub-DAG. Matching uses logical-plan equality: if the logical 
sub-DAGs are the same, the cached result can be reused regardless of the 
physical implementation.

- *Advantage:* Maximum reuse across physical-plan choices. For example, if 
Execution 1 realizes `Join` as `HashJoin` and caches the join output, Execution 
2 can reuse that cached result even if the optimizer would otherwise choose 
`SortMergeJoin`. The cache effectively preempts the physical-plan decision for 
that operator.
- *Disadvantage:* Cannot exploit sub-operator granularity. A `HashJoin` 
produces intermediate state (e.g., the build-side hash table) that could be 
cached and reused by another `HashJoin`, but a logical-level cache key cannot 
express this because `Build` is not visible at the logical level.

**Option B: Cache key on the physical plan.**
A cached result is associated with a physical operator's output port and its 
upstream physical sub-DAG (this is the current design).

- *Advantage:* Fine-grained reuse of intermediate physical results (e.g., a 
hash table from `Build`, a sorted run from `Sort`, a local aggregate from 
`LocalAgg`).
- *Disadvantage:* No cross-implementation reuse. A `HashJoin`'s output cannot 
match a `SortMergeJoin`'s output, even though they compute the same logical 
result. If the optimizer changes the physical plan between executions, all 
downstream cached results are invalidated.

**Correctness requirement:** Logical-level cache reuse assumes that all 
physical implementations of a logical operator produce equivalent output (at 
least bag-equal). If a downstream operator depends on output properties 
specific to one implementation (e.g., sorted order from `SortMergeJoin`), the 
optimizer must account for this when deciding whether to use a logical-level 
cached result.

## Proposed Extension: Cache Keys at Both Levels

Under a one-to-many mapping, we can support cache keys at both the logical and 
physical levels to combine the benefits:

1. **Compute cache keys at both levels.** Each logical-plan output port gets a 
logical cache key (based on the upstream logical sub-DAG and operator 
configurations), and each physical-plan output port gets a physical cache key 
(based on the upstream physical sub-DAG), as in the current design.

2. **Pass logical-level matches to the physical-plan optimizer.** Before 
physical-plan optimization, the system checks which logical output ports have 
cached results. These matches are passed to the physical-plan optimizer as 
additional input, since a logical-level match may make certain physical-plan 
choices unnecessary or change their relative cost.

3. **The optimizer considers both sources of reuse.** Each logical-plan output 
port maps uniquely to a physical-plan output port, regardless of which physical 
plan is chosen (see the figure below). The optimizer can therefore account for 
cache matches from both levels. A logical-level match provides the final output 
for free; a physical-level match provides a reusable intermediate result within 
the chosen implementation.

![cache-key-logical-physical-mapping](https://github.com/user-attachments/assets/c2a49bf6-0179-4242-8e36-dd6c7fa9292f)


In the figure above, the logical `Join` operator has output port `o1`. In 
Physical Plan 1 (HashJoin), `Join.o1` maps to `Probe.o1`. In Physical Plan 2 
(SortMergeJoin), `Join.o1` maps to `Merge.o1`. A logical cache key for 
`Join.o1` matches in both physical plans via this port mapping. Physical cache 
keys, on the other hand, do not match across implementations: `Probe.o1 ≠ 
Merge.o1`. Meanwhile, `Build.o1` is a physical-only port with no logical 
counterpart, enabling finer-grained reuse within the HashJoin implementation.

## Relation to the Current One-to-One Design

Under the current one-to-one mapping, the physical plan has strictly *more* 
cacheable locations than the logical plan. Physical operators that are internal 
to a logical operator (such as `Build` within `HashJoin`) have output ports 
with their own cache keys, but these ports have no logical-plan counterpart. So 
even under one-to-one mapping, physical-plan cache keys provide finer-grained 
reuse than logical-plan cache keys alone.

The proposed extension adds logical-level cache keys on top of the existing 
physical-level keys. This only becomes useful under one-to-many 
logical-to-physical mapping, where the cross-implementation reuse benefit of 
logical keys comes into play.

## Note on Reuse Policy

When a cached result matches (at either level), the system always reuses it 
rather than recomputing. This is a simplifying assumption; there are workloads 
where recomputation would be cheaper (e.g., when a cached result is large and 
expensive to read from storage). A cost-based reuse decision is an interesting 
direction but is outside the scope of this design.

GitHub link: https://github.com/apache/texera/discussions/4346

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to