Hey Folks, I would like to raise awareness of a change I've been working on in PR #15590 and see if anyone is interested to take a look.
To start with a problem, when a single snapshot adds a large number of data files, our MergingSnapshotProducer will accumulate all of them in memory before the commit. This unbounded collection can lead to memory pressures or OOM failure during large writes. Think of full table compaction, CTAS on a table with wide columns and stats. Inspired by the rolling manifest writer, my approach to the problem is to automatically flush the accumulated data files to the manifest files once the count reaches a given threshold and now defaults to 100k. This helps spread the manifest writing I/O over the course of add operation and puts a fixed ceiling on the peak memory during commit. Flushed manifests are lazily read back for validation, and covered for commit retry and clean up on failure by current mechanism. For any commits adding fewer than 100k entries in a single snapshot, there's no behavior change. However, the default of 100k interacts with existing MIN_FILE_GROUP_SIZE of 10k to control how files are grouped for manifest writing, this effectively limits the manifest write parallelism to 10, even on the hardware with more available cores. I ran an appendFiles benchmark locally indicating 18% latency increased for appending 1M files (12s -> 14.1s) in exchange for the flat memory ceiling. Would appreciate the thoughts and feedback on if there's a better way to solve this problem. Also whether or not we shall allow this threshold to be configured or balanced more adaptively. Thanks, Hongyue Zhang link to PR: https://github.com/apache/iceberg/pull/15590
