adriangb commented on issue #22090:
URL: https://github.com/apache/datafusion/issues/22090#issuecomment-4425113379
Could an a answer be to tweak how `RepartitionExec` itself spills? I.e.
instead of accumulating data in memory it spills when there's a lot of skew. I
think it already spills in some conditions (memory pressure?) but maybe it
should spill more aggressively (e.g. only keep 1 batch in memory at a time).
Related: if the partition that is spilling is because it's upstream operator
is slow because the upstream operator spilled, I wonder if it would be
beneficial to have a `enum InFlightData { Memory(RecordBatch), Disk { file,
start, end }` or something like that. The point is: if we are going from one
spilling operator to another maybe pushing the data around on disk instead of
loading only to spill it again would make sense. But that'd be a _big change_.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]