I would also support adding this to Iceberg itself. I think we have a use case 
where we can leverage this.

@Ryan, could you also provide more info on the audit process?

Thanks,
Anton

> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
> 
> I think this could be useful. When we ingest data from Kafka, we do a 
> predefined set of checks on the data. We can potentially utilize something 
> like this to check for sanity before publishing.  
> 
> How is the auditing process suppose to find the new snapshot , since it is 
> not accessible from the table. Is it by convention?
> 
> -R 
> 
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Hi everyone,
> 
> At Netflix, we have a pattern for building ETL jobs where we write data, then 
> audit the result before publishing the data that was written to a final 
> table. We call this WAP for write, audit, publish.
> 
> We’ve added support in our Iceberg branch. A WAP write creates a new table 
> snapshot, but doesn’t make that snapshot the current version of the table. 
> Instead, a separate process audits the new snapshot and updates the table’s 
> current snapshot when the audits succeed. I wasn’t sure that this would be 
> useful anywhere else until we talked to another company this week that is 
> interested in the same thing. So I wanted to check whether this is a good 
> feature to include in Iceberg itself.
> 
> This works by staging a snapshot. Basically, Spark writes data as expected, 
> but Iceberg detects that it should not update the table’s current stage. That 
> happens when there is a Spark property, spark.wap.id <http://spark.wap.id/>, 
> that indicates the job is a WAP job. Then any table that has WAP enabled by 
> the table property write.wap.enabled=true will stage the new snapshot instead 
> of fully committing, with the WAP ID in the snapshot’s metadata.
> 
> Is this something we should open a PR to add to Iceberg? It seems a little 
> strange to make it appear that a commit has succeeded, but not actually 
> change a table, which is why we didn’t submit it before now.
> 
> Thanks,
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Reply via email to