2010YOUY01 commented on issue #21973: URL: https://github.com/apache/datafusion/issues/21973#issuecomment-4414934122
> - No `pushdown_sorts` top-down pass I have a question about how to eliminate `pushdown_sorts`. For example: ``` case 1: sort(k1 asc) -- filter (k1 > 0) ---- scan(existing file order: k1 asc) case 2: sort(k1 desc) -- filter (k1 > 0) ---- scan(existing file order: k1 asc) // file format supports reverse scan ``` For case 1, a bottom-up traversal can eliminate the `SortExec`, because the existing input order of `SortExec` already satisfies the requirement. For case 2, however, it seems that some top-down negotiation is still needed: `SortExec` has to first tell the scanner to perform a reverse scan, and then the bottom-up traversal can remove the redundant sort. I was thinking that maybe we can let the `EnsureRequirements` rule only do a bottom-up scan and insert all required operators for ordering and repartitioning requirements. Other decisions, such as exploring reversed file scan order in the example above, can be kept in other optimizer rules. This would make `EnsureRequirements` easier to maintain. ---- Another question: the current `EnforceSorting` rule seems to handle a lot of special cases: - https://github.com/apache/datafusion/blob/e62f06c914cfdfb388abf0b49a45afe015d76fcb/datafusion/physical-optimizer/src/enforce_sorting/mod.rs#L198 I'm not sure how all those special cases are caused. Perhaps a unified bottom-up approach with `EnsureRequirements` can eliminate some of them. We might also want to separate some existing edge cases out to keep the `EnsureRequirements` rule simpler and more focused. ---- Here is one observation that might simplify the implementation a bit: Currently, in physical planning, `SortExec` is ambiguous. In the initial physical plan, the logical plan's sort node is directly converted to `SortExec`. In this case, it is more like a placeholder that is actually a "logical" sort operator, and it relies on later optimizer passes to add the required `SortPreservingMergeExec` to make the whole plan valid. Meanwhile, for some operators that have input requirements, specifically `SortMergeJoin` and `BoundedWindowAggExec`, their initial physical plans don't include `SortExec`. Instead, they directly use `ExecutionPlan::required_input_ordering` to let later optimizer passes insert the required sort operators. So we might choose either approach and make them consistent. This could make the later implementation require fewer special cases. For example, we could convert the logical plan's sort operator to `OutputRequirementExec` in the initial physical plan, so sort requirements have a consistent representation. Alternatively, we could insert `SortExec` below `SortMergeJoin` / `BoundedWindowAggExec` if that approach is easier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
