Re: [PR] Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides [datafusion]

via GitHub Wed, 19 Nov 2025 01:43:40 -0800


martin-g commented on code in PR #18814:
URL: https://github.com/apache/datafusion/pull/18814#discussion_r2541222147



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1729,23 +1738,53 @@ pub fn requalify_sides_if_needed(
 ) -> Result<(LogicalPlanBuilder, LogicalPlanBuilder, bool)> {
     let left_cols = left.schema().columns();
     let right_cols = right.schema().columns();
-    if left_cols.iter().any(|l| {
-        right_cols.iter().any(|r| {
-            l == r || (l.name == r.name && (l.relation.is_none() || 
r.relation.is_none()))
-        })
-    }) {
-        // These names have no connection to the original plan, but they'll 
make the columns
-        // (mostly) unique.
-        Ok((
-            left.alias(TableReference::bare("left"))?,
-            right.alias(TableReference::bare("right"))?,
-            true,
-        ))
-    } else {
-        Ok((left, right, false))
+
+    // Requalify if merging the schemas would cause an error during join.
+    // This can happen in several cases:
+    // 1. Duplicate qualified fields: both sides have same relation.name
+    // 2. Duplicate unqualified fields: both sides have same unqualified name
+    // 3. Ambiguous reference: one side qualified, other unqualified, same name
+    for l in &left_cols {
+        for r in &right_cols {

Review Comment:
   Here the complexity is O(n*m).
   You could optimize it to O(n+m) by iterating over left_cols (O(n)) and 
storing them in a HashMap<ColumnName, Column>, then while iterating over 
right_cols (O(m)) lookup by name in the hashmap (O(1)) and do the checks when 
there is an entry for that name.



##########
datafusion/substrait/tests/cases/roundtrip_logical_plan.rs:
##########
@@ -1099,6 +1099,92 @@ async fn simple_intersect_table_reuse() -> Result<()> {
     Ok(())
 }
 
+#[tokio::test]
+async fn self_referential_intersect() -> Result<()> {
+    // Test INTERSECT with the same table on both sides
+    // This previously failed with "Schema contains duplicate qualified field 
name"
+    // The fix ensures requalify_sides_if_needed is called in 
intersect_or_except
+    // After roundtrip through Substrait, SubqueryAlias is lost and 
requalification
+    // produces "left" and "right" aliases
+    // Note: INTERSECT (without ALL) includes DISTINCT, but the outer Aggregate
+    // is optimized away, resulting in just the LeftSemi join
+    assert_expected_plan(
+        "SELECT a FROM data WHERE a > 0 INTERSECT SELECT a FROM data WHERE a < 
5",
+        "LeftSemi Join: left.a = right.a\
+        \n  SubqueryAlias: left\
+        \n    Aggregate: groupBy=[[data.a]], aggr=[[]]\
+        \n      Filter: data.a > Int64(0)\
+        \n        TableScan: data projection=[a], partial_filters=[data.a > 
Int64(0)]\
+        \n  SubqueryAlias: right\
+        \n    Filter: data.a < Int64(5)\
+        \n      TableScan: data projection=[a], partial_filters=[data.a < 
Int64(5)]",
+        true,
+    )
+    .await
+}
+
+#[tokio::test]
+async fn self_referential_except() -> Result<()> {
+    // Test EXCEPT with the same table on both sides
+    // This previously failed with "Schema contains duplicate qualified field 
name"
+    // The fix ensures requalify_sides_if_needed is called in 
intersect_or_except
+    // After roundtrip through Substrait, SubqueryAlias is lost and 
requalification
+    // produces "left" and "right" aliases
+    // Note: EXCEPT (without ALL) includes DISTINCT, but the outer Aggregate
+    // is optimized away, resulting in just the LeftAnti join
+    assert_expected_plan(
+        "SELECT a FROM data WHERE a > 0 EXCEPT SELECT a FROM data WHERE a < 5",
+        "LeftAnti Join: left.a = right.a\

Review Comment:
   Is there a difference between the plans for INTERSECT 
(self_referential_intersect) and EXCEPT (self_referential_except) ?
   I don't see any.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides [datafusion]

Reply via email to