Whatsonyourmind commented on issue #21310:
URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4199268781

   @xiedeyantu Good follow-up on the LIMIT case. You are right that the 
filter-distinctness check already guards the scenario I raised — if the two 
UNION branches have different filters (`a=1` vs `b=2`), they cannot be 
collapsed because the output rows are not necessarily the same.
   
   The case you are asking about — same underlying table, same columns, both 
with LIMIT but no differing filter — is the interesting one. The rewrite is 
unsafe in general because:
   
   ```
   (SELECT mgr, comm FROM emp LIMIT 2) UNION (SELECT mgr, comm FROM emp LIMIT 2)
   ```
   
   The UNION semantics here are: take up to 2 rows from each side, then dedupe 
the combined result. The row count can be anywhere from 2 (both sides return 
the same 2 rows) to 4 (the two LIMITs pick disjoint rows). Rewriting this to 
`SELECT DISTINCT mgr, comm FROM emp LIMIT 4` changes the semantics — you are 
now asking for up to 4 distinct rows from the full table, not "the union of two 
non-deterministic 2-row samples."
   
   The rewrite is safe only when:
   
   1. Both LIMITs are deterministic (i.e., there is an ORDER BY that fully 
determines the row order), AND
   2. The LIMIT values are known at plan time, AND
   3. You can prove the two LIMIT clauses yield the same row set
   
   In practice the simplest rule is: **skip the rewrite entirely if either 
branch contains LIMIT**. This matches what PostgreSQL does in its UNION-to-OR 
rewriter and avoids the subtle non-determinism trap. The complexity of 
detecting the safe case is high and the payoff is low (the unsafe case is much 
more common than the safe case).
   
   The same rule applies to OFFSET, FOR UPDATE, and any other non-deterministic 
or state-modifying clause on the UNION branches. Adding a small guard function 
`is_limit_like_clause(plan)` that covers all of these in one predicate keeps 
the main rewrite logic clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to