westonpace opened a new pull request, #40722:
URL: https://github.com/apache/arrow/pull/40722

   ### Rationale for this change
   
   The dataset writer would fire the resume callback as soon as the underlying 
dataset writer's queues freed up, even if there were pending tasks.  
Backpressure is not applied immediately and so a few tasks will always trickle 
in.  If backpressure is pausing and then resuming frequently this can lead to a 
buildup of pending tasks and uncontrolled memory growth.
   
   ### What changes are included in this PR?
   
   The resume callback is not called until all pending write tasks have 
completed.
   
   ### Are these changes tested?
   
   There is quite an extensive set of tests for the dataset writer already and 
they continue to pass.  I ran them on repeat, with and without stress, and did 
not see any issues.
   
   However, the underlying problem (dataset writer can have uncontrolled memory 
growth) is still not tested as it is quite difficult to test.  I was able to 
run the setup described in the issue to reproduce the issue.  With this fix the 
repartitioning task completes for me.
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to