alamb commented on issue #8043:
URL:
https://github.com/apache/arrow-datafusion/issues/8043#issuecomment-1806069322
Ok, I spent a while debugging our problem today
What is happening is that a `RepartitionExec` was accidentally keeping the
`preserve_ordering` flag even when the input wasn't sorted, which caused the
wrong variant to be used
This diff fixed my problem:
```diff
diff --git a/datafusion/physical-plan/src/repartition/mod.rs
b/datafusion/physical-plan/src/repartition/mod.rs
index 66f7037e5c..17babcd109 100644
--- a/datafusion/physical-plan/src/repartition/mod.rs
+++ b/datafusion/physical-plan/src/repartition/mod.rs
@@ -644,16 +644,16 @@ impl RepartitionExec {
})
}
- /// Set Order preserving flag
+
+ /// Set Order preserving flag, which controlls if this node is
+ /// `RepartitionExec` or `SortPreservingRepartitionExec`. If the input
is
+ /// not ordered, or has only one partiton, remains a `RepartitionExec`.
pub fn with_preserve_order(mut self, preserve_order: bool) -> Self {
- // Set "preserve order" mode only if the input partition count is
larger than 1
- // Because in these cases naive `RepartitionExec` cannot maintain
ordering. Using
- // `SortPreservingRepartitionExec` is necessity. However, when
input partition number
- // is 1, `RepartitionExec` can maintain ordering. In this case, we
don't need to use
- // `SortPreservingRepartitionExec` variant to maintain ordering.
- if self.input.output_partitioning().partition_count() > 1 {
- self.preserve_order = preserve_order
- }
+ self.preserve_order = preserve_order &&
+ // If the input isn't ordered, there is no ordering to
preserve
+ self.input.output_ordering().is_some() &&
+ // if there is only one input partition, merging is
required to maintain order
+ self.input.output_partitioning().partition_count() > 1;
self
}
```
I will prepare a PR with this fix shortly
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]