[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1776: Update `ExecutionPlan` to know about sortedness and repartitioning optimizer pass respect the invariants

GitBox Tue, 08 Feb 2022 09:55:52 -0800


alamb commented on a change in pull request #1776:
URL: https://github.com/apache/arrow-datafusion/pull/1776#discussion_r801911051




##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -142,29 +142,64 @@ pub trait ExecutionPlan: Debug + Send + Sync {
     /// Specifies the output partitioning scheme of this plan
     fn output_partitioning(&self) -> Partitioning;
 
+    /// If the output of this operator is sorted, returns `Some(keys)`
+    /// with the description of how it was sorted.
+    ///
+    /// For example, Sort, (obviously) produces sorted output as does
+    /// SortPreservingMergeStream. Less obviously `Projection`
+    /// produces sorted output if its input was sorted as it does not
+    /// reorder the input rows
+    fn output_ordering(&self) -> Option<&[PhysicalSortExpr]>;
+
     /// Specifies the data distribution requirements of all the children for 
this operator
     fn required_child_distribution(&self) -> Distribution {
         Distribution::UnspecifiedDistribution
     }
 
-    /// Returns `true` if the direct children of this `ExecutionPlan` should 
be repartitioned
-    /// to introduce greater concurrency to the plan
+    /// Returns `true` if this operator relies on its inputs being
+    /// produced in a certain order (for example that they are sorted a 
particular way) for correctness.
     ///
-    /// The default implementation returns `true` unless 
`Self::required_child_distribution`
-    /// returns `Distribution::SinglePartition`
+    /// If `true` is returned, DataFusion will not apply certain
+    /// optimizations which might reorder the inputs (such as
+    /// repartitioning to increase concurrency).
     ///
-    /// Operators that do not benefit from additional partitioning may want to 
return `false`
-    fn should_repartition_children(&self) -> bool {
-        !matches!(
-            self.required_child_distribution(),
-            Distribution::SinglePartition
-        )
+    /// The default implementation returns `false`
+    fn relies_on_input_order(&self) -> bool {
+        false
+    }
+
+    /// Returns `false` if this operator's implementation may reorder
+    /// rows within or between partitions.
+    ///
+    /// For example, Projection, Filter, and Limit maintain the order
+    /// of inputs -- they may transform values (Projection) or not
+    /// produce the same number of rows that went in (Filter and
+    /// Limit), but the rows that are produced go in the same way.
+    ///
+    /// DataFusion uses this metadata to apply certain optimizations
+    /// such as automatically repartitioning correctly.
+    ///
+    /// The default implementation returns `false`
+    fn maintains_input_order(&self) -> bool {
+        false
+    }
+
+    /// Returns `true` if this operator would benefit from
+    /// partitioning its input (and thus from more parallelism). For
+    /// operators that do very little work the overhead of extra
+    /// parallelism may outweigh any benefits
+    ///
+    /// The default implementation returns `true`
+    fn benefits_from_input_partitioning(&self) -> bool {

Review comment:
       I think it defaults to true, actually.
   
   Perhaps you mean why do `sort` `limit` and `union` override the default 
`maintains_input_order`? If so the reason is that I know how they are 
implemented. The code on master is making the same assumption, FWIW, but after 
this PR the assumption is explicit
   
   ```rust
       fn maintains_input_order(&self) -> bool {
           // tell optimizer this operator doesn't reorder its input
           true
       }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1776: Update `ExecutionPlan` to know about sortedness and repartitioning optimizer pass respect the invariants

Reply via email to