xiedeyantu commented on issue #21310:
URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4199532658

   > [@xiedeyantu](https://github.com/xiedeyantu) Good follow-up on the LIMIT 
case. You are right that the filter-distinctness check already guards the 
scenario I raised — if the two UNION branches have different filters (`a=1` vs 
`b=2`), they cannot be collapsed because the output rows are not necessarily 
the same.
   > 
   > The case you are asking about — same underlying table, same columns, both 
with LIMIT but no differing filter — is the interesting one. The rewrite is 
unsafe in general because:
   > 
   > ```
   > (SELECT mgr, comm FROM emp LIMIT 2) UNION (SELECT mgr, comm FROM emp LIMIT 
2)
   > ```
   > 
   > The UNION semantics here are: take up to 2 rows from each side, then 
dedupe the combined result. The row count can be anywhere from 2 (both sides 
return the same 2 rows) to 4 (the two LIMITs pick disjoint rows). Rewriting 
this to `SELECT DISTINCT mgr, comm FROM emp LIMIT 4` changes the semantics — 
you are now asking for up to 4 distinct rows from the full table, not "the 
union of two non-deterministic 2-row samples."
   > 
   > The rewrite is safe only when:
   > 
   > 1. Both LIMITs are deterministic (i.e., there is an ORDER BY that fully 
determines the row order), AND
   > 2. The LIMIT values are known at plan time, AND
   > 3. You can prove the two LIMIT clauses yield the same row set
   > 
   > In practice the simplest rule is: **skip the rewrite entirely if either 
branch contains LIMIT**. This matches what PostgreSQL does in its UNION-to-OR 
rewriter and avoids the subtle non-determinism trap. The complexity of 
detecting the safe case is high and the payoff is low (the unsafe case is much 
more common than the safe case).
   > 
   > The same rule applies to OFFSET, FOR UPDATE, and any other 
non-deterministic or state-modifying clause on the UNION branches. Adding a 
small guard function `is_limit_like_clause(plan)` that covers all of these in 
one predicate keeps the main rewrite logic clean.
   
   Thank you for your interest and keen observations. I have already rejected 
scenarios involving `SORT` and `LIMIT` within `UNION` clauses in the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to