[I] [Feature Request] Add Writer Support for Table-Compatible Parquet Files [iceberg-python]

via GitHub Thu, 27 Feb 2025 06:42:03 -0800


andormarkus-alcd opened a new issue, #1737:
URL: https://github.com/apache/iceberg-python/issues/1737


   ### Feature Request / Improvement
   
   ## Problem Statement
   
    **I'm happy to submit a PR to implement this feature.**
   
   PyIceberg currently provides functionality to add existing Parquet files to 
an Iceberg table using `add_files()`, which is useful when files already exist 
in a compatible format. However, the library lacks a convenient way to write 
new Parquet files that are automatically compatible with the Iceberg table 
format, specifically:
   
   1. There's no straightforward API to write Parquet files that match an 
Iceberg table's schema, partitioning, and other metadata requirements
   2. Users currently need to implement complex logic to ensure schema 
alignment, partition compatibility, etc.
   3. This creates an unnecessary barrier for users wanting to write files that 
can later be added to Iceberg tables without rewriting
   
   
   ## Use Case: High-Throughput Ingestion with AWS Lambda
   We are currently using AWS Lambda functions to write to Iceberg tables. When 
ingesting large volumes of files concurrently, we run into Lambda concurrency 
limits because:
   - The Parquet writing process is the most time-consuming part of the 
operation
   - The commit phase is relatively fast
   
   By separating these operations (writing compatible Parquet files 
independently, then committing via `add_files()` through a queue), we could 
significantly increase our throughput. This would allow us to:
   
   - Use Lambda functions to write compatible Parquet files in parallel
   - Queue the much faster commit operations separately
   - Use a second Lambda function to process these queued operations in bulk 
(e.g., every minute), committing 
    multiple files at once rather than one by one
   - Avoid concurrency limits that are currently bottlenecking our ingestion 
pipeline
   
   ## Proposed Solution
   Add a new API to PyIceberg that allows writing table-compatible Parquet 
files. This could look something like:
   ```python
   # Possible API design
   writer = tbl.parquet_writer()
   writer.write_dataframe(df)  # No destination_path needed as table has 
location info
   
   # Or alternatively
   tbl.write_parquet(df)  # Writes to table's data location with appropriate 
naming
   
   # These files could then be added without rewriting
   tbl.add_files()  # Can discover compatible files in the table's data location
   ```
   
   ## The implementation would:
   1. Automatically handle schema alignment
   2. Apply correct partition transforms
   3. Add appropriate metadata to ensure compatibility with add_files()
   4. Set up Name Mapping appropriately
   5. Generate files without field IDs in the Parquet metadata (as required by 
add_files())
   6. Use the table's location information to determine write paths 
automatically
   
   ## Benefits
   This would create a complete workflow for efficiently managing Iceberg 
tables:
    - Write compatible files
    - Add them to tables without rewriting
    - Perform normal maintenance operations
   
   
   This new feature would simplify creating files that meet these requirements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Feature Request] Add Writer Support for Table-Compatible Parquet Files [iceberg-python]

Reply via email to