Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

via GitHub Thu, 24 Jul 2025 13:02:42 -0700


GitHub user alamb added a comment to the discussion: Best practices for 
memory-efficient deduplication of pre-sorted Parquet files


Yes, please, I actually did some testing today, 
- https://github.com/apache/datafusion/issues/16899
- https://github.com/apache/datafusion/pull/16900

What I would expect in this case is to see an `AggregateExec` in the plan that 
had the annotation of `ordering_mode=PartiallySorted([0]` (note that is 
different than the "Partial" annotation)



```sql
AggregateExec: mode=Partial, gby=[a@0 as a, b@1 as b], aggr=[count(Int64(1))], 
ordering_mode=PartiallySorted([0])
```

Perhaps you can double check the explain plan like `EXPLAIN FORMAT INDENT ..` 
(which will produce a more detailed version of explain that has many more 
details)

Thanks for sticking with this


GitHub link: 
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13881971

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

Reply via email to