neilconway commented on PR #21240:
URL: https://github.com/apache/datafusion/pull/21240#issuecomment-4165653133

   I spent a little while thinking about how to overlap evaluation of the main 
query with the subqueries; at least so far, I haven't been able to find a clean 
solution.
   
   The core problem is that expression evaluation is synchronous. That means 
the naive design doesn't work: ideally we'd just start evaluating both the main 
query and all the subqueries, and have `SubqueryEvalExpr` in the main query 
block until the corresponding subquery result is available. That doesn't play 
nicely with synchronous expression evaluation; probably it also would be prone 
to deadlock.
   
   Physical plan evaluation is async, so that is a natural place to safely wait 
for subquery evaluation. So one idea is to add a wrapper physical plan node to 
every place that evaluates an expression containing a reference to an 
uncorrelated subquery. Let's call it `WaitForSubqueryExec`. That does two 
things concurrently:
   
   1. Wait for a given subquery result to be available
   2. Start asynchronously evaluating the wrapped child node and buffering its 
content, up to some given buffer size (seems like we could use `BufferExec` for 
this or roll our own thing)
   
   Once the subquery result is available, we can then just proceed with normal 
expression evaluation / pulling on the plan's input.
   
   So the logical plan goes from
   ```
   FilterExec(predicate_with_subquery_expr, child_node)
   ```
   to
   ```
   WaitForSubqueryExec(FilterExec(predicate_with_subquery_expr, 
BufferExec(child_node)))
   ```
   
   That should work, and actually seems like not a horrible design, but 
cluttering the physical plan with two extra operators seems unfortunate.
   
   It's a bit surprising to me that we need to work this hard to get parity 
with the old cross-join approach, which isn't doing anything special to overlap 
computation. I guess it just falls out more naturally from the old approach / 
existing join machinery.
   
   Curious what you think or if you can suggest a better approach @Dandandan 
(cc @alamb)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to