2010YOUY01 commented on issue #21973:
URL: https://github.com/apache/datafusion/issues/21973#issuecomment-4414934122

   > - No `pushdown_sorts` top-down pass
   
   I have a question about how to eliminate `pushdown_sorts`. For example:
   
   ```
   case 1:
   sort(k1 asc)
   -- filter (k1 > 0)
   ---- scan(existing file order: k1 asc)
   
   case 2:
   sort(k1 desc)
   -- filter (k1 > 0)
   ---- scan(existing file order: k1 asc)  // file format supports reverse scan
   ```
   
   For case 1, a bottom-up traversal can eliminate the `SortExec`, because the 
existing input order of `SortExec` already satisfies the requirement. For case 
2, however, it seems that some top-down negotiation is still needed: `SortExec` 
has to first tell the scanner to perform a reverse scan, and then the bottom-up 
traversal can remove the redundant sort.
   
   I was thinking that maybe we can let the `EnsureRequirements` rule only do a 
bottom-up scan and insert all required operators for ordering and 
repartitioning requirements. Other decisions, such as exploring reversed file 
scan order in the example above, can be kept in other optimizer rules. This 
would make `EnsureRequirements` easier to maintain.
   
   ----
   
   Another question: the current `EnforceSorting` rule seems to handle a lot of 
special cases:
   
   - 
https://github.com/apache/datafusion/blob/e62f06c914cfdfb388abf0b49a45afe015d76fcb/datafusion/physical-optimizer/src/enforce_sorting/mod.rs#L198
   
   I'm not sure how all those special cases are caused. Perhaps a unified 
bottom-up approach with `EnsureRequirements` can eliminate some of them. We 
might also want to separate some existing edge cases out to keep the 
`EnsureRequirements` rule simpler and more focused.
   
   ----
   
   Here is one observation that might simplify the implementation a bit:
   
   Currently, in physical planning, `SortExec` is ambiguous. In the initial 
physical plan, the logical plan's sort node is directly converted to 
`SortExec`. In this case, it is more like a placeholder that is actually a 
"logical" sort operator, and it relies on later optimizer passes to add the 
required `SortPreservingMergeExec` to make the whole plan valid.
   
   Meanwhile, for some operators that have input requirements, specifically 
`SortMergeJoin` and `BoundedWindowAggExec`, their initial physical plans don't 
include `SortExec`. Instead, they directly use 
`ExecutionPlan::required_input_ordering` to let later optimizer passes insert 
the required sort operators.
   
   So we might choose either approach and make them consistent. This could make 
the later implementation require fewer special cases. For example, we could 
convert the logical plan's sort operator to `OutputRequirementExec` in the 
initial physical plan, so sort requirements have a consistent representation. 
Alternatively, we could insert `SortExec` below `SortMergeJoin` / 
`BoundedWindowAggExec` if that approach is easier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to