Hi everyone, At Netflix, we have a pattern for building ETL jobs where we write data, then audit the result before publishing the data that was written to a final table. We call this WAP for write, audit, publish.
We’ve added support in our Iceberg branch. A WAP write creates a new table snapshot, but doesn’t make that snapshot the current version of the table. Instead, a separate process audits the new snapshot and updates the table’s current snapshot when the audits succeed. I wasn’t sure that this would be useful anywhere else until we talked to another company this week that is interested in the same thing. So I wanted to check whether this is a good feature to include in Iceberg itself. This works by staging a snapshot. Basically, Spark writes data as expected, but Iceberg detects that it should not update the table’s current stage. That happens when there is a Spark property, spark.wap.id, that indicates the job is a WAP job. Then any table that has WAP enabled by the table property write.wap.enabled=true will stage the new snapshot instead of fully committing, with the WAP ID in the snapshot’s metadata. Is this something we should open a PR to add to Iceberg? It seems a little strange to make it appear that a commit has succeeded, but not actually change a table, which is why we didn’t submit it before now. Thanks, rb -- Ryan Blue Software Engineer Netflix
