edrevo commented on pull request #9523: URL: https://github.com/apache/arrow/pull/9523#issuecomment-786237494
Ok, after looking at the code for a while, I'm going to throw in a theory: the problem lies with https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/repartition.rs#L216 That line will block the thread until there is data coming in. This is not a good idea in Tokio, since if there are enough partitions, that would end up blocking all of the threads in the threadpool. I believe this should be `.try_recv` and we should map `Err(TryRecvError::Empty)` to `Poll::Pending`. What is my working theory? The downstream ExecutionPlan will request progress from as many partitions as it can. If these requests happen fast enough, they might block the tokio threadpool, and the task which is in charge of feeding data to these channels cannot progress: deadlock. I don't have time to make the change an test it out, but if someone wants to pick this up and see if it solves the issue please go ahead. If not, I'll try it out tomorrow. Cheers! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org