GitHub user ndchandar created a discussion: Feedback on high memory usage when merging N parquet files
Hello, I am writing a program that takes N parquet files (where N = 40). Each source parquet file is about ~6 to ~8 MB in size and are Zstd compressed. They are compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It appears that we need as much as **~24 GB** of memory to have a successful compaction. This gist https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138 is the interesting bit. It basically lists all files in a directory, takes N files and compacts them. I tried giving hints to the optimizer that the sources are already sorted but it doesn't seem to help. The row group size is set to 1M. Giving less memory (E.g 12 or 16 gb), I am running into the below issue ``` Caused by: Resources exhausted: Failed to allocate additional 2.0 MB for ExternalSorterMerge[4] with 49.8 MB already allocated for this reservation - 1826.2 KB remain available for the total pool ``` I am trying to understand why the spill is not happening efficiently (I am relatively new to DataFusion). Looking for any help/hints to reduce the memory utilization GitHub link: https://github.com/apache/datafusion/discussions/18833 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
