[DISCUSS] Write-audit-publish support

Ryan Blue Fri, 19 Jul 2019 14:02:04 -0700

Hi everyone,

At Netflix, we have a pattern for building ETL jobs where we write data,
then audit the result before publishing the data that was written to a
final table. We call this WAP for write, audit, publish.


We’ve added support in our Iceberg branch. A WAP write creates a new table
snapshot, but doesn’t make that snapshot the current version of the table.
Instead, a separate process audits the new snapshot and updates the table’s
current snapshot when the audits succeed. I wasn’t sure that this would be
useful anywhere else until we talked to another company this week that is
interested in the same thing. So I wanted to check whether this is a good
feature to include in Iceberg itself.

This works by staging a snapshot. Basically, Spark writes data as expected,
but Iceberg detects that it should not update the table’s current stage.
That happens when there is a Spark property, spark.wap.id, that indicates
the job is a WAP job. Then any table that has WAP enabled by the table
property write.wap.enabled=true will stage the new snapshot instead of
fully committing, with the WAP ID in the snapshot’s metadata.

Is this something we should open a PR to add to Iceberg? It seems a little
strange to make it appear that a commit has succeeded, but not actually
change a table, which is why we didn’t submit it before now.

Thanks,

rb
-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] Write-audit-publish support

Reply via email to