rynewang opened a new pull request, #3304: URL: https://github.com/apache/iceberg-python/pull/3304
## Summary `_FastAppendFiles` currently writes every added data file into a single manifest. `commit.manifest.target-size-bytes` (default 8 MiB) is already defined and honoured by `_MergeAppendFiles` when bin-packing existing manifests, but the fast-append path never reads it — so a large append produces one arbitrarily large manifest. Java Iceberg's `RollingManifestWriter` rolls over at the target on the fast-append path too; this brings pyiceberg to parity. ## Change `_SnapshotProducer._manifests()` now: - Writes the first added-manifest inline until `writer.tell()` reaches `commit.manifest.target-size-bytes`, which yields an exact entries-per-manifest count for this append's entry shape. - Chunks the remaining `_added_data_files` by that count and submits each chunk to the existing `ExecutorFactory` thread pool. Encoding is GIL-bound, but block compression (zlib) and the output-file write both release the GIL, so one chunk's encode overlaps with earlier chunks' compress/upload. - Submits `_write_delete_manifest` and `_existing_manifests` to the pool *before* starting the inline first chunk, so they still run concurrently. Small appends (everything fits under the target) take the same code path and produce one manifest, same as before. ## Why multiple manifests matter downstream - Planners (Trino/Athena/Spark) read manifests in parallel; one multi-GB manifest is a single-threaded decode bottleneck. - The manifest-list carries per-manifest partition bounds, so predicates can skip whole manifests without opening them — which is impossible with one monolith. - Delete/overwrite rewrites any manifest containing an affected file; with one large manifest, a single-row delete rewrites the whole thing. ## Testing - `test_fast_append_rolls_added_manifests_at_target_size`: creates a table with `commit.manifest.target-size-bytes=4096`, appends 200 data files, asserts `_manifests()` returns more than one manifest, all entries accounted for, and each manifest is within a small multiple of the target. - `test_fast_append_single_manifest_when_under_target`: 3 files → one manifest. - Existing `tests/table/test_snapshots.py` and `tests/utils/test_manifest.py` pass unchanged. ## Relationship to #3303 Independent — this PR only touches `snapshot.py`, #3303 only touches `pyiceberg/avro/`. The thread-pool overlap here helps with or without the Cython encoder (pure-Python encode still overlaps with compress/upload); with #3303 the encode floor drops and the overlap ratio improves. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
