[D] Feedback on high memory usage when merging N parquet files [datafusion]

via GitHub Wed, 19 Nov 2025 18:29:47 -0800


GitHub user ndchandar created a discussion: Feedback on high memory usage when 
merging N parquet files


Hello,
I am writing a program that takes N parquet files (where N = 40). Each source 
parquet file is about ~6 to ~8 MB  in size and are Zstd compressed. They are 
compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It 
appears that we need as much as **~24 GB** of memory to have a successful 
compaction. This gist 
https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138
 is the interesting bit. It basically lists all files in a directory, takes N 
files and compacts them. I tried giving hints to the optimizer that the sources 
are already sorted but it doesn't seem to help. The row group size is set to 
1M. 

Giving less memory (E.g 12 or 16 gb), I am running into the below issue
```
Caused by:
    Resources exhausted: Failed to allocate additional 2.0 MB for 
ExternalSorterMerge[4] with 49.8 MB already allocated for this reservation - 
1826.2 KB remain available for the total pool
```
I am trying to understand why the spill is not happening efficiently (I am 
relatively new to DataFusion). Looking for any help/hints to reduce the memory 
utilization

GitHub link: https://github.com/apache/datafusion/discussions/18833

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[D] Feedback on high memory usage when merging N parquet files [datafusion]

Reply via email to