Re: [I] PyIceberg Production Use case survey [iceberg-python]

via GitHub Tue, 27 Jan 2026 01:32:43 -0800


EnyMan commented on issue #1202:
URL: 
https://github.com/apache/iceberg-python/issues/1202#issuecomment-3804099682


   We use PyIceberg to store features for our ML models. We do this daily with 
full history. We also have this speciality where we need to be able to backfill 
the data as if we stored it that day. We can compute them like that, but we 
also need the query interface to be stable, so in case of a failed job, we can 
rerun it as if it were run at that date (after a failed job, we would stop all 
future jobs, so we don't break the time continuity). We are upsert-heavy, with 
tables ranging from 1 column to 100. Milions or 10s of milions of rows. In some 
tables, only a few rows can change; in others, potentially all rows can change. 
So upsert speed is a must for us (see 
https://github.com/apache/iceberg-python/pull/2943 where I attempt to optimize 
upsert for these types of upserts)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] PyIceberg Production Use case survey [iceberg-python]

Reply via email to