xiedeyantu commented on issue #21310: URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4199532658
> [@xiedeyantu](https://github.com/xiedeyantu) Good follow-up on the LIMIT case. You are right that the filter-distinctness check already guards the scenario I raised — if the two UNION branches have different filters (`a=1` vs `b=2`), they cannot be collapsed because the output rows are not necessarily the same. > > The case you are asking about — same underlying table, same columns, both with LIMIT but no differing filter — is the interesting one. The rewrite is unsafe in general because: > > ``` > (SELECT mgr, comm FROM emp LIMIT 2) UNION (SELECT mgr, comm FROM emp LIMIT 2) > ``` > > The UNION semantics here are: take up to 2 rows from each side, then dedupe the combined result. The row count can be anywhere from 2 (both sides return the same 2 rows) to 4 (the two LIMITs pick disjoint rows). Rewriting this to `SELECT DISTINCT mgr, comm FROM emp LIMIT 4` changes the semantics — you are now asking for up to 4 distinct rows from the full table, not "the union of two non-deterministic 2-row samples." > > The rewrite is safe only when: > > 1. Both LIMITs are deterministic (i.e., there is an ORDER BY that fully determines the row order), AND > 2. The LIMIT values are known at plan time, AND > 3. You can prove the two LIMIT clauses yield the same row set > > In practice the simplest rule is: **skip the rewrite entirely if either branch contains LIMIT**. This matches what PostgreSQL does in its UNION-to-OR rewriter and avoids the subtle non-determinism trap. The complexity of detecting the safe case is high and the payoff is low (the unsafe case is much more common than the safe case). > > The same rule applies to OFFSET, FOR UPDATE, and any other non-deterministic or state-modifying clause on the UNION branches. Adding a small guard function `is_limit_like_clause(plan)` that covers all of these in one predicate keeps the main rewrite logic clean. Thank you for your interest and keen observations. I have already rejected scenarios involving `SORT` and `LIMIT` within `UNION` clauses in the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
