lawofcycles commented on PR #3320:
URL: https://github.com/apache/iceberg-python/pull/3320#issuecomment-4370712868

   ## Benchmark results
   
   This PR brings three capabilities to PyIceberg's write path.
   
   1. Transparent retry for concurrent writes. Users no longer need to 
implement retry logic around `table.append()` or `table.delete()`.
   2. Data conflict validation. Incompatible concurrent modifications (e.g. 
concurrent deletes on the same data) are detected and rejected with 
`ValidationException`, preventing silent data corruption.
   3. Efficient retry via data file reuse. On retry, only manifests are 
regenerated. Data files already written to S3 are reused, avoiding redundant 
Parquet writes.
   
   To validate (3), I benchmarked concurrent appends using the NYC Yellow Taxi 
dataset (2024-01, 2.9M rows, 19 columns) with Glue Data Catalog + S3.
   
   ### Before vs After
   
   Without this PR, concurrent appends fail immediately with 
`CommitFailedException`. Only one writer succeeds per batch, regardless of 
parallelism.
   
   | Workers | Before (no retry) | After (this PR) |
   |--------:|-------------------:|----------------:|
   |       2 |              50.0% |          100.0% |
   |       4 |              25.0% |          100.0% |
   |       8 |              12.5% |          100.0% |
   
   (N workers x 10 batches x 1K rows, `commit.retry.num-retries=10`, 
`commit.retry.min-wait-ms=500`)
   
   ### Internal retry vs user-side retry
   
   Compared the internal retry (this PR) against a user-side retry that catches 
`CommitFailedException` and re-does `load_table` + `append` from scratch. Both 
use the same backoff parameters (retries=15, min-wait=500ms).
   
   | Workers | Internal retry | User-side retry | Speedup |
   |--------:|---------------:|----------------:|--------:|
   |       2 |            33s |             46s |    1.4x |
   |       4 |            68s |             87s |    1.3x |
   |       8 |           167s |            299s |    1.8x |
   |      16 |           399s |            588s |    1.5x |
   
   (3 batches per worker, ~370K-1.5M rows per batch depending on worker count)
   
   Internal retry is faster because it reuses data files already written to S3 
and only regenerates manifests on retry. User-side retry rewrites Parquet files 
on every attempt.
   
   Interestingly, internal retry actually performs *more* retries than 
user-side retry (88 vs 50 total retries at 8 workers), because the shorter 
retry window increases commit attempt density. Despite more retries, the total 
time is lower because each retry is much cheaper.
   
   ### Tuning `commit.retry.min-wait-ms`
   
   Tested different `min-wait-ms` values with 8 workers to find the optimal 
backoff for Glue.
   
   | min-wait-ms | Total time | Total retries |
   |------------:|-----------:|--------------:|
   |         100 |       158s |            78 |
   |         500 |       126s |           115 |
   |        1000 |       238s |            67 |
   |        2000 |       235s |            63 |
   |        3000 |       206s |            41 |
   
   The default (100ms, matching Java Iceberg) works reasonably well, but 500ms 
is optimal for Glue. Too short causes contention storms, too long wastes time 
waiting. The optimal value depends on the catalog's commit latency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to