alamb commented on a change in pull request #1776: URL: https://github.com/apache/arrow-datafusion/pull/1776#discussion_r801911051
########## File path: datafusion/src/physical_plan/mod.rs ########## @@ -142,29 +142,64 @@ pub trait ExecutionPlan: Debug + Send + Sync { /// Specifies the output partitioning scheme of this plan fn output_partitioning(&self) -> Partitioning; + /// If the output of this operator is sorted, returns `Some(keys)` + /// with the description of how it was sorted. + /// + /// For example, Sort, (obviously) produces sorted output as does + /// SortPreservingMergeStream. Less obviously `Projection` + /// produces sorted output if its input was sorted as it does not + /// reorder the input rows + fn output_ordering(&self) -> Option<&[PhysicalSortExpr]>; + /// Specifies the data distribution requirements of all the children for this operator fn required_child_distribution(&self) -> Distribution { Distribution::UnspecifiedDistribution } - /// Returns `true` if the direct children of this `ExecutionPlan` should be repartitioned - /// to introduce greater concurrency to the plan + /// Returns `true` if this operator relies on its inputs being + /// produced in a certain order (for example that they are sorted a particular way) for correctness. /// - /// The default implementation returns `true` unless `Self::required_child_distribution` - /// returns `Distribution::SinglePartition` + /// If `true` is returned, DataFusion will not apply certain + /// optimizations which might reorder the inputs (such as + /// repartitioning to increase concurrency). /// - /// Operators that do not benefit from additional partitioning may want to return `false` - fn should_repartition_children(&self) -> bool { - !matches!( - self.required_child_distribution(), - Distribution::SinglePartition - ) + /// The default implementation returns `false` + fn relies_on_input_order(&self) -> bool { + false + } + + /// Returns `false` if this operator's implementation may reorder + /// rows within or between partitions. + /// + /// For example, Projection, Filter, and Limit maintain the order + /// of inputs -- they may transform values (Projection) or not + /// produce the same number of rows that went in (Filter and + /// Limit), but the rows that are produced go in the same way. + /// + /// DataFusion uses this metadata to apply certain optimizations + /// such as automatically repartitioning correctly. + /// + /// The default implementation returns `false` + fn maintains_input_order(&self) -> bool { + false + } + + /// Returns `true` if this operator would benefit from + /// partitioning its input (and thus from more parallelism). For + /// operators that do very little work the overhead of extra + /// parallelism may outweigh any benefits + /// + /// The default implementation returns `true` + fn benefits_from_input_partitioning(&self) -> bool { Review comment: I think it defaults to true, actually. Perhaps you mean why do `sort` `limit` and `union` override the default `maintains_input_order`? If so the reason is that I know how they are implemented. The code on master is making the same assumption, FWIW, but after this PR the assumption is explicit ```rust fn maintains_input_order(&self) -> bool { // tell optimizer this operator doesn't reorder its input true } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org