adriangb commented on issue #22090:
URL: https://github.com/apache/datafusion/issues/22090#issuecomment-4425113379

   Could an a answer be to tweak how `RepartitionExec` itself spills? I.e. 
instead of accumulating data in memory it spills when there's a lot of skew. I 
think it already spills in some conditions (memory pressure?) but maybe it 
should spill more aggressively (e.g. only keep 1 batch in memory at a time).
   
   Related: if the partition that is spilling is because it's upstream operator 
is slow because the upstream operator spilled, I wonder if it would be 
beneficial to have a `enum InFlightData { Memory(RecordBatch), Disk { file, 
start, end }` or something like that. The point is: if we are going from one 
spilling operator to another maybe pushing the data around on disk instead of 
loading only to spill it again would make sense. But that'd be a _big change_.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to