james727 commented on a change in pull request #1618:
URL: https://github.com/apache/arrow-datafusion/pull/1618#discussion_r789951842



##########
File path: datafusion/src/optimizer/filter_push_down.rs
##########
@@ -1139,6 +1136,172 @@ mod tests {
         Ok(())
     }
 
+    /// post-join predicates on the right side of a left join are duplicated 
and pushed down
+    /// to the left side.
+    #[test]
+    fn filter_using_left_join() -> Result<()> {
+        let table_scan = test_table_scan()?;
+        let left = LogicalPlanBuilder::from(table_scan).build()?;
+        let right_table_scan = test_table_scan_with_name("test2")?;
+        let right = LogicalPlanBuilder::from(right_table_scan)
+            .project(vec![col("a")])?
+            .build()?;
+        let plan = LogicalPlanBuilder::from(left)
+            .join_using(
+                &right,
+                JoinType::Left,
+                vec![Column::from_name("a".to_string())],
+            )?
+            .filter(col("test2.a").lt_eq(lit(1i64)))?
+            .build()?;
+
+        // not part of the test, just good to know:
+        assert_eq!(
+            format!("{:?}", plan),
+            "\
+            Filter: #test2.a <= Int64(1)\
+            \n  Join: Using #test.a = #test2.a\
+            \n    TableScan: test projection=None\
+            \n    Projection: #test2.a\
+            \n      TableScan: test2 projection=None"
+        );
+
+        // filter duplicated and pushed down to the left
+        let expected = "\
+        Filter: #test2.a <= Int64(1)\
+        \n  Join: Using #test.a = #test2.a\
+        \n    Filter: #test.a <= Int64(1)\

Review comment:
       Oh man, great catch! I think this gets back to some previous discussion 
around filters that preserve nulls vs filters that do not preserve nulls. It 
seems this filter duplication logic is valid for filters that do not preserve 
nulls, but breaks when the filter preserves nulls (as shown above).
   
   E.g. - correct me if I'm wrong I  _think_ that duplicating the filter is 
correct in the following query:
   ```
   SELECT *
   FROM t1 LEFT JOIN t2
   ON t1.x = t2.y
   WHERE t2.y > 5 -- Can duplicate filter for t1.x > 5
   ```
   
   What do you think about going back to the logic in [this 
commit](https://github.com/apache/arrow-datafusion/pull/1618/commits/c80772b5ce2c1a6419cefb247d55ec9ef2ef9dbd)
 that just skips this optimization for non-inner joins (before I got fancy with 
it). If a filter does not preserve nulls, we will be able to rewrite the join 
as an `INNER` join anyways, so once that optimization is implemented we'll get 
the correct duplication logic with another pass of this optimizer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to