gene-bordegaray opened a new issue, #18594:
URL: https://github.com/apache/datafusion/issues/18594

   ### Is your feature request related to a problem or challenge?
   
   Hash Repartitions track whether input / sort order is maintained but does 
not always display this property when using `EXPLAIN` on a query.
   
   ### Describe the solution you'd like
   
   We should display `maintains_order` from the `maintains_intput_order()` 
function. This function returns true if the repartition if the 
`preserve_order=true` or `input_partitions <= 1` thus it will always display 
when a repartition is maintaining input order explicitly and implicitly.
   
   ### Describe alternatives you've considered
   
   Only displaying `maintains_input_order()` function output and eliminate the 
`preserver_order` display => This would lose visibility into the implicit and 
explicit decisions that are being made.
   
   ### Additional context
   
   Order within repartitions is dependent on two things, the `preserve_order` 
flag and `input_partition_count`
   1. If the preserve order flag is true then no matter the number of input 
partitions the order will be preserved
   2. If the preserve order flag is false then order is only preserved if there 
is a single input partition
   3. Otherwise ordering is not preserved
   
   This property can be seen in use (and where it was highlighted) in 
joins.slt:3267:
   
   ```
   query TT
   EXPLAIN SELECT *
     FROM (SELECT *, ROW_NUMBER() OVER() as rn1
          FROM annotated_data ) as l_table
     JOIN (SELECT *, ROW_NUMBER() OVER() as rn1
          FROM annotated_data ) as r_table
     ON l_table.a = r_table.a
     ORDER BY l_table.a ASC NULLS FIRST, l_table.b, l_table.c, r_table.rn1
   ----
   physical_plan
   01)SortPreservingMergeExec: [a@1 ASC, b@2 ASC NULLS LAST, c@3 ASC NULLS 
LAST, rn1@11 ASC NULLS LAST]
   02)--SortMergeJoin: join_type=Inner, on=[(a@1, a@1)]
   03)----CoalesceBatchesExec: target_batch_size=2
   04)------RepartitionExec: partitioning=Hash([a@1], 2), input_partitions=1
   05)--------ProjectionExec: expr=[a0@0 as a0, a@1 as a, b@2 as b, c@3 as c, 
d@4 as d, row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING@5 as rn1]
   06)----------BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING: Field { "row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING": UInt64 }, frame: ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING], mode=[Sorted]
   07)------------DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a, 
b, c, d], output_ordering=[a@1 ASC, b@2 ASC NULLS LAST, c@3 ASC NULLS LAST], 
file_type=csv, has_header=true
   08)----CoalesceBatchesExec: target_batch_size=2
   09)------RepartitionExec: partitioning=Hash([a@1], 2), input_partitions=1
   10)--------ProjectionExec: expr=[a0@0 as a0, a@1 as a, b@2 as b, c@3 as c, 
d@4 as d, row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING@5 as rn1]
   11)----------BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING: Field { "row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING": UInt64 }, frame: ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING], mode=[Sorted]
   12)------------DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a, 
b, c, d], output_ordering=[a@1 ASC, b@2 ASC NULLS LAST, c@3 ASC NULLS LAST], 
file_type=csv, has_header=true
   ```
   
   This test takes advantage of the fact that Hash Repartitioning preservers 
ordering by not inserting a `SortExec` node in between the `SortMergeJoin` and 
`RepartitionHash`, but it does not clearly display that the Hash Repartition 
has the order preserving property.
   
   The display logic can be found in 
`datafustion/physical-plan/src/repartition/mod.rs`:611-650. The 
`preserve_order` field is only displayed when the flag is true. It does not 
show that order is preserved when `input_partitions=1` which is misleading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to