This would be super helpful. We have a similar workflow where we do some validation before letting the downstream consume the changes.
Best, Mouli On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote: > This definitely sounds interesting. Quick question on whether this > presents impact on the current Upserts spec? Or is it maybe that we are > looking to associate this support for append-only commits? > > On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Audits run on the snapshot by setting the snapshot-id read option to >> read the WAP snapshot, even though it has not (yet) been the current table >> state. This is documented in the time travel >> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg >> site. >> >> We added a stageOnly method to SnapshotProducer that adds the snapshot >> to table metadata, but does not make it the current table state. That is >> called by the Spark writer when there is a WAP ID, and that ID is embedded >> in the staged snapshot’s metadata so processes can find it. >> >> I'll add a PR with this code, since there is interest. >> >> rb >> >> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnyc...@apple.com> >> wrote: >> >>> I would also support adding this to Iceberg itself. I think we have a >>> use case where we can leverage this. >>> >>> @Ryan, could you also provide more info on the audit process? >>> >>> Thanks, >>> Anton >>> >>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote: >>> >>> I think this could be useful. When we ingest data from Kafka, we do a >>> predefined set of checks on the data. We can potentially utilize something >>> like this to check for sanity before publishing. >>> >>> How is the auditing process suppose to find the new snapshot , since it >>> is not accessible from the table. Is it by convention? >>> >>> -R >>> >>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> Hi everyone, >>>> >>>> At Netflix, we have a pattern for building ETL jobs where we write >>>> data, then audit the result before publishing the data that was written to >>>> a final table. We call this WAP for write, audit, publish. >>>> >>>> We’ve added support in our Iceberg branch. A WAP write creates a new >>>> table snapshot, but doesn’t make that snapshot the current version of the >>>> table. Instead, a separate process audits the new snapshot and updates the >>>> table’s current snapshot when the audits succeed. I wasn’t sure that this >>>> would be useful anywhere else until we talked to another company this week >>>> that is interested in the same thing. So I wanted to check whether this is >>>> a good feature to include in Iceberg itself. >>>> >>>> This works by staging a snapshot. Basically, Spark writes data as >>>> expected, but Iceberg detects that it should not update the table’s current >>>> stage. That happens when there is a Spark property, spark.wap.id, that >>>> indicates the job is a WAP job. Then any table that has WAP enabled by the >>>> table property write.wap.enabled=true will stage the new snapshot >>>> instead of fully committing, with the WAP ID in the snapshot’s metadata. >>>> >>>> Is this something we should open a PR to add to Iceberg? It seems a >>>> little strange to make it appear that a commit has succeeded, but not >>>> actually change a table, which is why we didn’t submit it before now. >>>> >>>> Thanks, >>>> >>>> rb >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Filip Bocse >