[I] `FairSpillPool` penalises the active operator in a pipeline of blocking spillable operators [datafusion]

via GitHub Tue, 05 May 2026 21:06:20 -0700


JanKaul opened a new issue, #22036:
URL: https://github.com/apache/datafusion/issues/22036

### Describe the bug

`FairSpillPool` divides spillable memory by the number of
currently-registered
spillable `MemoryConsumer`s. Combined with two facts about how DataFusion
plans execute, this means the operator actually doing work can be capped far
below its fair share:

1. `SortExec::execute` (and other operators that build an `ExternalSorter`)
register their `MemoryConsumer` **synchronously** during `execute()`,
before the returned stream is ever polled

([`sort.rs:1248-1259`](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L1248-L1259)).
Parent operators call children's `execute()` synchronously too. So **all**
spillable consumers in a plan typically register before any data flows.
2. `SortExec` is a blocking operator: the first poll on its output stream
runs the entire input-consumption phase before yielding. So in a chain of
blocking spillable operators, only the bottom-most operator is doing work
at any given moment, even though every operator above it is already
registered with the pool.

The result is that the operator currently doing the work is artificially
capped at `pool_size / num_spill`, where `num_spill` includes operators
that have not yet started.

## Example

Consider a plan with five stacked `SortExec`s sharing a `FairSpillPool` sized
so that a single sort fits comfortably at `cap == pool_size`, but not at
`cap == pool_size / 5`:

```
SortExec (top, level 4)
└── SortExec (level 3)
└── SortExec (level 2)
└── SortExec (level 1)
└── SortExec (bottom, level 0)
└── <source>
```

When the root stream is polled:

- All five `execute()` calls run recursively and synchronously, so all five
`ExternalSorter` consumers register up-front. `num_spill == 5`.
- The first poll drives the bottom sort's input phase. It is the only
operator doing real work, but its per-reservation cap is `pool_size / 5`,
so it spills repeatedly.
- As the bottom sort finishes its input phase and unregisters, the next
level up starts its input phase — now under `num_spill == 4`, with a
larger fair share.
- This continues up the chain. The top sort runs last under `num_spill == 1`
and gets the entire pool, never spilling at all.

A representative spill-count distribution from bottom to top looks like:

```
[3, 3, 2, 2, 0]
```

Even though every level processes the same data shape against the same pool,
the operator that runs first is forced to spill the most, and the operator
that runs last does not spill at all.

## Expected behavior

For operators in a pipeline of blocking spillable operators processing
identically-shaped inputs against the same pool, the spill behaviour should
not depend on **execution order**. The per-reservation fair share should
reflect operators that are *currently competing* for memory, not all
operators that have been registered but are still waiting their turn.

## Possible directions for discussion

(Not a proposal — just to start the conversation.)

One worth-considering reference point is Velox's hierarchical memory pool
structure. Instead of a single flat pool with a counter of registered
consumers, Velox organises memory pools as a tree: a root pool per query,
with child pools per task / pipeline / operator. Each pool tracks its own
reservation and propagates usage up to its parent, and memory arbitration
is performed by walking the tree rather than by dividing a global capacity
by a global consumer count.

I've created a [PR](https://github.com/apache/datafusion/pull/22035) that
explores the solution with a SubPool for blocking operators. This shows how a
MemoryPool hierarchy can be used to allow for more complex scenarios than the
current MemoryPools.

### To Reproduce

_No response_

### Expected behavior

_No response_

### Additional context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] `FairSpillPool` penalises the active operator in a pipeline of blocking spillable operators [datafusion]

Reply via email to