rynewang opened a new pull request, #3304:
URL: https://github.com/apache/iceberg-python/pull/3304

   ## Summary
   
   `_FastAppendFiles` currently writes every added data file into a single 
manifest. `commit.manifest.target-size-bytes` (default 8 MiB) is already 
defined and honoured by `_MergeAppendFiles` when bin-packing existing 
manifests, but the fast-append path never reads it — so a large append produces 
one arbitrarily large manifest. Java Iceberg's `RollingManifestWriter` rolls 
over at the target on the fast-append path too; this brings pyiceberg to parity.
   
   ## Change
   
   `_SnapshotProducer._manifests()` now:
   
   - Writes the first added-manifest inline until `writer.tell()` reaches 
`commit.manifest.target-size-bytes`, which yields an exact entries-per-manifest 
count for this append's entry shape.
   - Chunks the remaining `_added_data_files` by that count and submits each 
chunk to the existing `ExecutorFactory` thread pool. Encoding is GIL-bound, but 
block compression (zlib) and the output-file write both release the GIL, so one 
chunk's encode overlaps with earlier chunks' compress/upload.
   - Submits `_write_delete_manifest` and `_existing_manifests` to the pool 
*before* starting the inline first chunk, so they still run concurrently.
   
   Small appends (everything fits under the target) take the same code path and 
produce one manifest, same as before.
   
   ## Why multiple manifests matter downstream
   
   - Planners (Trino/Athena/Spark) read manifests in parallel; one multi-GB 
manifest is a single-threaded decode bottleneck.
   - The manifest-list carries per-manifest partition bounds, so predicates can 
skip whole manifests without opening them — which is impossible with one 
monolith.
   - Delete/overwrite rewrites any manifest containing an affected file; with 
one large manifest, a single-row delete rewrites the whole thing.
   
   ## Testing
   
   - `test_fast_append_rolls_added_manifests_at_target_size`: creates a table 
with `commit.manifest.target-size-bytes=4096`, appends 200 data files, asserts 
`_manifests()` returns more than one manifest, all entries accounted for, and 
each manifest is within a small multiple of the target.
   - `test_fast_append_single_manifest_when_under_target`: 3 files → one 
manifest.
   - Existing `tests/table/test_snapshots.py` and 
`tests/utils/test_manifest.py` pass unchanged.
   
   ## Relationship to #3303
   
   Independent — this PR only touches `snapshot.py`, #3303 only touches 
`pyiceberg/avro/`. The thread-pool overlap here helps with or without the 
Cython encoder (pure-Python encode still overlaps with compress/upload); with 
#3303 the encode floor drops and the overlap ratio improves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to