alamb commented on PR #19360: URL: https://github.com/apache/datafusion/pull/19360#issuecomment-3666437472
> Edit: I was mistaken, the cross join is actually returning batches sufficiently frequently. It's actually the aggregation loop that's not yielding. I'm reminded of the original discussion we had on this topic in #16196. Whose responsibility is it to yield sufficiently frequently? Back then I argued that it's the looping code (i.e. the aggregation loop) that's responsible for this because that's where the root cause of the problem lies. My memory was that the general consensus was that the operators themselves (ie. in this case the aggregation loop) should be yielding periodically to the scheduler, and that the cooperative() method was a workaround that gave us some more time to figure out how to do this (and tokio's budget wasn't yet public) > I've experimented a bit with a variant of what's suggested in this PR in https://github.com/pepijnve/datafusion/tree/coop_joins. I really like this method from https://github.com/pepijnve/datafusion/commit/aeecad3c4e88d68f3593c065f6b2266b20fb8809 ```rust /// Consumes a unit of budget. If the task budget has been depleted `Poll::Pending` is returned. /// Otherwise, `Poll::Ready(()))` is returned. /// /// The task will only yield if its entire coop budget has been exhausted. /// This function can be used in order to insert optional yield points into long /// computations that do not use Tokio resources, /// without redundantly yielding to the runtime each time. pub fn consume_budget(cx: &mut Context<'_>) -> Poll<()> { tokio::task::coop::poll_proceed(cx).map(|r| r.made_progress()) } ``` @pepijnve would your proposal be to add the appropriate `consume_budget` calls in group by hash instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
