neilconway opened a new pull request, #21240: URL: https://github.com/apache/datafusion/pull/21240
## Which issue does this PR close? - Closes #3781. - Closes #18181. ## Rationale for this change Previously, DataFusion evaluated uncorrelated scalar subqueries by transforming them into joins. This has two shortcomings: 1. Scalar subqueries that return > 1 row were allowed, producing incorrect query results. Such queries should instead result in a runtime error. 2. Performance. Evaluating scalar subqueries as a join requires going through the join machinery. More importantly, it means that UDFs that have special-cases for scalar inputs cannot use those code paths for scalar subqueries, which often results in significantly slower query execution. This PR introduces physical execution of uncorrelated scalar subqueries: * Uncorrelated subqueries are left in the plan by the logical planner, not rewritten into joins * The physical planner collects uncorrelated scalar subqueries, and plans them recursively (supporting nested subqueries). We add a `ScalarSubqueryExec` plan node to the top of any physical plan with uncorrelated subqueries: it has n+1 children, n subqueries and its "main" input, which is the rest of the query plan. The subquery expression is replaced with a `ScalarSubqueryExpr`. * `ScalarSubqueryExec` manages the execution of the subqueries and stashes the result in a shared "results container", which is an `Arc<Vec<OnceLock<ScalarValue>>>`. At present, subquery evaluation is done sequentially and not overlapped with evaluation of the parent query. * When `ScalarSubqueryExpr` is evaluated, it fetches the result of the subquery from the result container. This architecture makes it easy to avoid the two shortcomings described above. Performance seems roughly unchanged (benchmarks added in this PR), but in situations like #18181, we can now leverage scalar fast-paths; in the case of #18181, this improves performance from ~800 ms to ~30 ms. ## What changes are included in this PR? * Benchmarks * Modify subquery rewriter to not transform subqueries -> joins * Collect and plan uncorrelated scalar subqueries in the physical planner, and wire up `ScalarSubqueryExpr` * Support for proto serialization/deserialization using `PhysicalProtoConverterExtension` to wire up `ScalarSubqueryExpr` correctly * Add various SLT tests and update expected plan shapes for some tests ## Are these changes tested? Yes. ## Are there any user-facing changes? Not really. Scalar subqueries that returned > 1 row will now be rejected instead of producing incorrect query results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
