alamb commented on issue #22090:
URL: https://github.com/apache/datafusion/issues/22090#issuecomment-4415084807
> RepartitionExec's distribution channels (distributor_channels.rs) only
throttle producers when every output channel has at least one buffered item
This is by design to avoid deadlocks
> The producer should be throttled when total buffered memory crosses a
configured threshold, regardless of how many channels are technically non-empty.
If you have a situation where one of the channels is empty, are you
guaranteed that the other non empty channels can make progress? For example the
classic diamond plan
```
SortPreservingMerge (or some other operator
where consumption is a function
of the values in the streams)
┌──────────────────────┐
│ Merge │
└──────────────────────┘
▲ ▲ ▲
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │
┌───┐│ │ │ ┌───┐
│ ││ │ │ │ │
│ ││ │ │ │ │ Channel 1 and 3 are full
└───┘│ │ │ └───┘ / memory full
┌───┴───────┴──────┴───┐ but a batch is needed in
│ Repartition │ Channel 2 to make
└──────────────────────┘ progress
```
If you have one consumer falling behind, I think better strategy might be to
apply back pressure at the consumer end (rather than the Repartition)
What are the consumers in this case?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]