edrevo commented on pull request #9523:
URL: https://github.com/apache/arrow/pull/9523#issuecomment-786237494


   Ok, after looking at the code for a while, I'm going to throw in a theory: 
the problem lies with 
https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/repartition.rs#L216
   
   That line will block the thread until there is data coming in. This is not a 
good idea in Tokio, since if there are enough partitions, that would end up 
blocking all of the threads in the threadpool. I believe this should be 
`.try_recv` and we should map `Err(TryRecvError::Empty)` to `Poll::Pending`.
   
   What is my working theory? The downstream ExecutionPlan will request 
progress from as many partitions as it can. If these requests happen fast 
enough, they might block the tokio threadpool, and the task which is in charge 
of feeding data to these channels cannot progress: deadlock.
   
   I don't have time to make the change an test it out, but if someone wants to 
pick this up and see if it solves the issue please go ahead. If not, I'll try 
it out tomorrow.
   
   Cheers!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to