dekuu5 commented on issue #20027:
URL: https://github.com/apache/datafusion/issues/20027#issuecomment-3818140067

   Hello @2010YOUY01,
   
   I spent some time investigating this issue. Initially, I wasn't able to 
reproduce the bug even when running the tests 200 times in parallel.
   
   However, I wrote a custom stress-test script to run the test case with much 
higher concurrency (100 parallel instances), and I was finally able to 
reproduce the failure consistently.
   
   After debugging the reproduction, I identified a race condition in the 
coordination logic of the SpillPool. The poll_next function relies on a 
buffered stream (spawn_buffered) to read the file concurrently. The issue is 
that the background buffer task is not aware of the writer's status. Under 
heavy load, the buffer task can hit a temporary EOF (before the writer 
finishes) and quit prematurely. As a result, poll_next receives None from the 
stream and closes the reader before all batches are processed.
   i changed the stream to be a normal unbuffered stream and it worked i will 
open a pr shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to