Re: [PR] Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides [datafusion]

via GitHub Wed, 19 Nov 2025 05:45:20 -0800


kosiew commented on code in PR #18814:
URL: https://github.com/apache/datafusion/pull/18814#discussion_r2542076181



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1729,23 +1738,53 @@ pub fn requalify_sides_if_needed(
 ) -> Result<(LogicalPlanBuilder, LogicalPlanBuilder, bool)> {
     let left_cols = left.schema().columns();
     let right_cols = right.schema().columns();
-    if left_cols.iter().any(|l| {
-        right_cols.iter().any(|r| {
-            l == r || (l.name == r.name && (l.relation.is_none() || 
r.relation.is_none()))
-        })
-    }) {
-        // These names have no connection to the original plan, but they'll 
make the columns
-        // (mostly) unique.
-        Ok((
-            left.alias(TableReference::bare("left"))?,
-            right.alias(TableReference::bare("right"))?,
-            true,
-        ))
-    } else {
-        Ok((left, right, false))
+
+    // Requalify if merging the schemas would cause an error during join.
+    // This can happen in several cases:
+    // 1. Duplicate qualified fields: both sides have same relation.name
+    // 2. Duplicate unqualified fields: both sides have same unqualified name
+    // 3. Ambiguous reference: one side qualified, other unqualified, same name
+    for l in &left_cols {
+        for r in &right_cols {

Review Comment:
   Excellent observation on the algorithmic complexity. You're correct that the 
current nested loop is O(n*m), and this can be optimized to O(n+m) using a 
HashMap.
   
   **Analysis:**
   
   Here are some reasons for the current implementation:
   
   1. **Schema size is typically small:** For example, the TPC-H benchmark 
schemas range from 3-16 columns (median ~8). Even the largest table (lineitem 
with 16 columns) would only result in 256 comparisons worst-case for this 
function., which is negligible for modern CPUs.
   
   2. **Early return on conflict:** The function returns immediately upon 
finding the first conflict, so in the common case where conflicts exist (which 
is when this function matters), we often exit very early in the iteration.
   
   3. **Simple conflict detection logic:** The current implementation is 
straightforward and easy to reason about. The match statement clearly shows all 
conflict scenarios.
   
   4. **Called infrequently:** This function is only called during logical plan 
construction, not during execution. It's not in a hot path that runs millions 
of times.
   
   **Trade-offs of HashMap approach:**
   
   **Pros:**
   - O(n+m) vs O(n*m) complexity
   - Scales better for schemas with hundreds of columns
   
   **Cons:**
   - More memory allocation overhead for the HashMap
   - More complex code that's slightly harder to understand
   - HashMap construction and hashing overhead may not pay off for small schemas
   - Need to handle the case where multiple columns have the same name in one 
schema (which can happen with different qualifiers)
   
   If you feel strongly about this or if we anticipate very wide schemas 
(hundreds of columns), I'm happy to implement the HashMap-based optimization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix Schema Duplication Errors in Self‑Referential INTERSECT/EXCEPT by Requalifying Input Sides [datafusion]

Reply via email to