This definitely sounds interesting. Quick question on whether this presents impact on the current Upserts spec? Or is it maybe that we are looking to associate this support for append-only commits?
On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <[email protected]> wrote: > Audits run on the snapshot by setting the snapshot-id read option to read > the WAP snapshot, even though it has not (yet) been the current table > state. This is documented in the time travel > <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg > site. > > We added a stageOnly method to SnapshotProducer that adds the snapshot to > table metadata, but does not make it the current table state. That is > called by the Spark writer when there is a WAP ID, and that ID is embedded > in the staged snapshot’s metadata so processes can find it. > > I'll add a PR with this code, since there is interest. > > rb > > On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <[email protected]> > wrote: > >> I would also support adding this to Iceberg itself. I think we have a use >> case where we can leverage this. >> >> @Ryan, could you also provide more info on the audit process? >> >> Thanks, >> Anton >> >> On 20 Jul 2019, at 04:01, RD <[email protected]> wrote: >> >> I think this could be useful. When we ingest data from Kafka, we do a >> predefined set of checks on the data. We can potentially utilize something >> like this to check for sanity before publishing. >> >> How is the auditing process suppose to find the new snapshot , since it >> is not accessible from the table. Is it by convention? >> >> -R >> >> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <[email protected]> >> wrote: >> >>> Hi everyone, >>> >>> At Netflix, we have a pattern for building ETL jobs where we write data, >>> then audit the result before publishing the data that was written to a >>> final table. We call this WAP for write, audit, publish. >>> >>> We’ve added support in our Iceberg branch. A WAP write creates a new >>> table snapshot, but doesn’t make that snapshot the current version of the >>> table. Instead, a separate process audits the new snapshot and updates the >>> table’s current snapshot when the audits succeed. I wasn’t sure that this >>> would be useful anywhere else until we talked to another company this week >>> that is interested in the same thing. So I wanted to check whether this is >>> a good feature to include in Iceberg itself. >>> >>> This works by staging a snapshot. Basically, Spark writes data as >>> expected, but Iceberg detects that it should not update the table’s current >>> stage. That happens when there is a Spark property, spark.wap.id, that >>> indicates the job is a WAP job. Then any table that has WAP enabled by the >>> table property write.wap.enabled=true will stage the new snapshot >>> instead of fully committing, with the WAP ID in the snapshot’s metadata. >>> >>> Is this something we should open a PR to add to Iceberg? It seems a >>> little strange to make it appear that a commit has succeeded, but not >>> actually change a table, which is why we didn’t submit it before now. >>> >>> Thanks, >>> >>> rb >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Filip Bocse
