neilconway commented on PR #21240: URL: https://github.com/apache/datafusion/pull/21240#issuecomment-4165653133
I spent a little while thinking about how to overlap evaluation of the main query with the subqueries; at least so far, I haven't been able to find a clean solution. The core problem is that expression evaluation is synchronous. That means the naive design doesn't work: ideally we'd just start evaluating both the main query and all the subqueries, and have `SubqueryEvalExpr` in the main query block until the corresponding subquery result is available. That doesn't play nicely with synchronous expression evaluation; probably it also would be prone to deadlock. Physical plan evaluation is async, so that is a natural place to safely wait for subquery evaluation. So one idea is to add a wrapper physical plan node to every place that evaluates an expression containing a reference to an uncorrelated subquery. Let's call it `WaitForSubqueryExec`. That does two things concurrently: 1. Wait for a given subquery result to be available 2. Start asynchronously evaluating the wrapped child node and buffering its content, up to some given buffer size (seems like we could use `BufferExec` for this or roll our own thing) Once the subquery result is available, we can then just proceed with normal expression evaluation / pulling on the plan's input. So the logical plan goes from ``` FilterExec(predicate_with_subquery_expr, child_node) ``` to ``` WaitForSubqueryExec(FilterExec(predicate_with_subquery_expr, BufferExec(child_node))) ``` That should work, and actually seems like not a horrible design, but cluttering the physical plan with two extra operators seems unfortunate. It's a bit surprising to me that we need to work this hard to get parity with the old cross-join approach, which isn't doing anything special to overlap computation. I guess it just falls out more naturally from the old approach / existing join machinery. Curious what you think or if you can suggest a better approach @Dandandan (cc @alamb) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
