EnyMan commented on issue #1202: URL: https://github.com/apache/iceberg-python/issues/1202#issuecomment-3804099682
We use PyIceberg to store features for our ML models. We do this daily with full history. We also have this speciality where we need to be able to backfill the data as if we stored it that day. We can compute them like that, but we also need the query interface to be stable, so in case of a failed job, we can rerun it as if it were run at that date (after a failed job, we would stop all future jobs, so we don't break the time continuity). We are upsert-heavy, with tables ranging from 1 column to 100. Milions or 10s of milions of rows. In some tables, only a few rows can change; in others, potentially all rows can change. So upsert speed is a must for us (see https://github.com/apache/iceberg-python/pull/2943 where I attempt to optimize upsert for these types of upserts) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
