gene-bordegaray opened a new issue, #18594:
URL: https://github.com/apache/datafusion/issues/18594
### Is your feature request related to a problem or challenge?
Hash Repartitions track whether input / sort order is maintained but does
not always display this property when using `EXPLAIN` on a query.
### Describe the solution you'd like
We should display `maintains_order` from the `maintains_intput_order()`
function. This function returns true if the repartition if the
`preserve_order=true` or `input_partitions <= 1` thus it will always display
when a repartition is maintaining input order explicitly and implicitly.
### Describe alternatives you've considered
Only displaying `maintains_input_order()` function output and eliminate the
`preserver_order` display => This would lose visibility into the implicit and
explicit decisions that are being made.
### Additional context
Order within repartitions is dependent on two things, the `preserve_order`
flag and `input_partition_count`
1. If the preserve order flag is true then no matter the number of input
partitions the order will be preserved
2. If the preserve order flag is false then order is only preserved if there
is a single input partition
3. Otherwise ordering is not preserved
This property can be seen in use (and where it was highlighted) in
joins.slt:3267:
```
query TT
EXPLAIN SELECT *
FROM (SELECT *, ROW_NUMBER() OVER() as rn1
FROM annotated_data ) as l_table
JOIN (SELECT *, ROW_NUMBER() OVER() as rn1
FROM annotated_data ) as r_table
ON l_table.a = r_table.a
ORDER BY l_table.a ASC NULLS FIRST, l_table.b, l_table.c, r_table.rn1
----
physical_plan
01)SortPreservingMergeExec: [a@1 ASC, b@2 ASC NULLS LAST, c@3 ASC NULLS
LAST, rn1@11 ASC NULLS LAST]
02)--SortMergeJoin: join_type=Inner, on=[(a@1, a@1)]
03)----CoalesceBatchesExec: target_batch_size=2
04)------RepartitionExec: partitioning=Hash([a@1], 2), input_partitions=1
05)--------ProjectionExec: expr=[a0@0 as a0, a@1 as a, b@2 as b, c@3 as c,
d@4 as d, row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING@5 as rn1]
06)----------BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING: Field { "row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING": UInt64 }, frame: ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING], mode=[Sorted]
07)------------DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a,
b, c, d], output_ordering=[a@1 ASC, b@2 ASC NULLS LAST, c@3 ASC NULLS LAST],
file_type=csv, has_header=true
08)----CoalesceBatchesExec: target_batch_size=2
09)------RepartitionExec: partitioning=Hash([a@1], 2), input_partitions=1
10)--------ProjectionExec: expr=[a0@0 as a0, a@1 as a, b@2 as b, c@3 as c,
d@4 as d, row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING@5 as rn1]
11)----------BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING: Field { "row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING": UInt64 }, frame: ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING], mode=[Sorted]
12)------------DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a,
b, c, d], output_ordering=[a@1 ASC, b@2 ASC NULLS LAST, c@3 ASC NULLS LAST],
file_type=csv, has_header=true
```
This test takes advantage of the fact that Hash Repartitioning preservers
ordering by not inserting a `SortExec` node in between the `SortMergeJoin` and
`RepartitionHash`, but it does not clearly display that the Hash Repartition
has the order preserving property.
The display logic can be found in
`datafustion/physical-plan/src/repartition/mod.rs`:611-650. The
`preserve_order` field is only displayed when the flag is true. It does not
show that order is preserved when `input_partitions=1` which is misleading.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]