avantgardnerio commented on code in PR #2885:
URL: https://github.com/apache/arrow-datafusion/pull/2885#discussion_r926054946
##########
datafusion/optimizer/src/utils.rs:
##########
@@ -82,6 +119,199 @@ pub fn add_filter(plan: LogicalPlan, predicates: &[&Expr])
-> LogicalPlan {
})
}
+/// Looks for correlating expressions: equality expressions with one field
from the subquery, and
+/// one not in the subquery (closed upon from outer scope)
+///
+/// # Arguments
+///
+/// * `exprs` - List of expressions that may or may not be joins
+/// * `fields` - HashSet of fully qualified (table.col) fields in subquery
schema
+///
+/// # Return value
+///
+/// Tuple of (expressions containing joins, remaining non-join expressions)
+pub fn find_join_exprs(
+ exprs: Vec<&Expr>,
+ schema: &DFSchemaRef,
+) -> Result<(Vec<Expr>, Vec<Expr>)> {
+ let fields: HashSet<_> = schema
+ .fields()
+ .iter()
+ .map(|it| it.qualified_name())
+ .collect();
+
+ let mut joins = vec![];
+ let mut others = vec![];
+ for filter in exprs.iter() {
+ let (left, op, right) = match filter {
+ Expr::BinaryExpr { left, op, right } => (*left.clone(), *op,
*right.clone()),
+ _ => {
+ others.push((*filter).clone());
+ continue;
+ }
+ };
+ let left = match left {
+ Expr::Column(c) => c,
+ _ => {
+ others.push((*filter).clone());
+ continue;
+ }
+ };
+ let right = match right {
+ Expr::Column(c) => c,
+ _ => {
+ others.push((*filter).clone());
+ continue;
+ }
+ };
+ if fields.contains(&left.flat_name()) &&
fields.contains(&right.flat_name()) {
+ others.push((*filter).clone());
+ continue; // both columns present (none closed-upon)
+ }
+ if !fields.contains(&left.flat_name()) &&
!fields.contains(&right.flat_name()) {
+ others.push((*filter).clone());
+ continue; // neither column present (syntax error?)
+ }
+ match op {
+ Operator::Eq => {}
+ Operator::NotEq => {}
+ _ => {
+ plan_err!(format!("can't optimize {} column comparison", op))?;
Review Comment:
I left this in because there were tests for the `Eq` and `NotEq` cases, and
since errors just skip the optimization, I thought better to fail until we have
clearer test cases... You're probably right? To make it harder, this code is
shared across all three new optimization rules. It's probably fine for `SEMI`
and `ANTI` join, but I do worry about accidentally expanding row count or worse
with the scalar `inner` and `outer` joins.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]