andormarkus opened a new issue, #1751:
URL: https://github.com/apache/iceberg-python/issues/1751
### Feature Request / Improvement
## Problem Statement
A key problem in distributed Iceberg systems is that commit processes can
block each other when multiple workers try to update table metadata
simultaneously. This blocking creates a severe performance bottleneck that
limits throughput, particularly in high-volume ingestion scenarios.
## Use Case
In our distributed architecture:
1. Process A writes Parquet files in Iceberg-compatible format
2. Simple string identifiers (file paths) need to be passed between systems
3. Process B takes these strings and commits the files to make them visible
in queries
This pattern is especially useful for high-concurrency ingestion scenarios
where multiple writers could be writing data to an Iceberg table
simultaneously, but we want to centralize and coordinate the commit process.
This approach is critical because in distributed environments, commit processes
can block each other, creating a significant bottleneck in high-throughput
scenarios.
### Detailed Workflow
Our workflow involves:
```python
# Process A: Write data but don't commit
table = catalog.load_table(identifier="iceberg.table")
data_files = list(pyiceberg.io.pyarrow._dataframe_to_data_files(
table_metadata=table1.metadata, write_uuid=uuid.uuid4(), df=pa_df,
io=table1.io
)
)
queue.send(data_files) # Send data_files strings to queue system
# Process B: Commit processor (runs separately)
data_files = queue.receive()
with table.transaction() as trx:
with trx.update_snapshot().fast_append() as update_snapshot:
for data_file in data_files:
update_snapshot.append_data_file(data_file)
```
This separation of write and commit operations provides several advantages:
- Improved throughput by parallelizing write operations across multiple
workers
- Reduced lock contention since metadata commits (which require locks) are
centralized
- Better failure handling - failed writes don't impact the table state
- Controlled transaction timing - commits can be batched or scheduled
optimally
- Elimination of commit process blocking - by centralizing commits, we
prevent distributed writers from blocking each other during metadata updates,
which is a major performance bottleneck
## Current Limitations
- Serializing `DataFile` objects between processes is challenging
- We've attempted custom serialization with compression (gzip, zlib), which
is working however required long complex code
- Using `jsonpickle` also presented significant problems
## Proposed Solution
We're seeking a robust way to handle distributed writes, potentially with:
1. Add serialization/deserialization methods to the `DataFile` class
2. Support Avro for efficient serialization of `DataFile` objects
(potentially smaller than other approaches)
3. Better integration with `append_data_file` API
4. OR a more accessible way to use the ManifestFile functionality that's
already implemented in PyIceberg
Ideally, the solution would:
- Handle schema evolution gracefully (unlike current `add_files` approach
which has issues when schema changes)
- Work efficiently with minimal overhead for large-scale concurrent
processing
- Provide simple primitives that can be used in distributed systems without
requiring complex serialization
- Follow patterns similar to those used in the Java implementation where
appropriate
## Alternative Approaches Tried
- We've implemented a custom serialization/deserialization function with
compression
- We explored the approach in #1678, but found it created too many commits
and became a performance bottleneck
## Related PRs/Issues
- PR #1742 (closed): Original attempt at write_parquet API
- Issue #1737: Feature request for Table-Compatible Parquet Files
- Issue #1678: Related implementation suggestion
We're looking for guidance on the best approach to solve this distributed
writing pattern while maintaining performance and schema compatibility.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]