Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

via GitHub Mon, 21 Jul 2025 12:34:43 -0700


GitHub user zheniasigayev added a comment to the discussion: Best practices for 
memory-efficient deduplication of pre-sorted Parquet files


> If you could make a reproducer with synthetic data and file a ticket I would 
> be happy to look into this further

I created a public Gist which you can find here: 
https://gist.github.com/zheniasigayev/2e5e471c9070cfa685d938bced47aa7f. 

I confirmed that the [2 
queries](https://github.com/apache/datafusion/discussions/16776#discussioncomment-13780110)
 that I provided in the discussion above produced the same query plan, and 
memory consumers, when run against the generated parquet files.

GitHub link: 
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13837673

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

Reply via email to