Hi Ryan, Can you please help me point to doc, where I can find how to publish a WAP snapshot? I am able to filter the snapshot, based on wap.id in summary of Snapshot, but clueless the official recommendation on committing that snapshot. I can think of cherry-picking Appended/Deleted files, but don't know the nuances of missing something important with this.
Thanks, -Ashish > ---------- Forwarded message --------- > From: Ryan Blue <[email protected]> > Date: Wed, Jul 31, 2019 at 4:41 PM > Subject: Re: [DISCUSS] Write-audit-publish support > To: Edgar Rodriguez <[email protected]> > Cc: Iceberg Dev List <[email protected]>, Anton Okolnychyi < > [email protected]> > > > Hi everyone, I've added PR #342 > <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg > repository with our WAP changes. Please have a look if you were interested > in this. > > On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez < > [email protected]> wrote: > >> I think this use case is pretty helpful in most data environments, we do >> the same sort of stage-check-publish pattern to run quality checks. >> One question is, if say the audit part fails, is there a way to expire >> the snapshot or what would be the workflow that follows? >> >> Best, >> Edgar >> >> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <[email protected]> >> wrote: >> >>> This would be super helpful. We have a similar workflow where we do some >>> validation before letting the downstream consume the changes. >>> >>> Best, >>> Mouli >>> >>> On Mon, Jul 22, 2019 at 9:18 AM Filip <[email protected]> wrote: >>> >>>> This definitely sounds interesting. Quick question on whether this >>>> presents impact on the current Upserts spec? Or is it maybe that we are >>>> looking to associate this support for append-only commits? >>>> >>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <[email protected]> >>>> wrote: >>>> >>>>> Audits run on the snapshot by setting the snapshot-id read option to >>>>> read the WAP snapshot, even though it has not (yet) been the current table >>>>> state. This is documented in the time travel >>>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg >>>>> site. >>>>> >>>>> We added a stageOnly method to SnapshotProducer that adds the >>>>> snapshot to table metadata, but does not make it the current table state. >>>>> That is called by the Spark writer when there is a WAP ID, and that ID is >>>>> embedded in the staged snapshot’s metadata so processes can find it. >>>>> >>>>> I'll add a PR with this code, since there is interest. >>>>> >>>>> rb >>>>> >>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi < >>>>> [email protected]> wrote: >>>>> >>>>>> I would also support adding this to Iceberg itself. I think we have a >>>>>> use case where we can leverage this. >>>>>> >>>>>> @Ryan, could you also provide more info on the audit process? >>>>>> >>>>>> Thanks, >>>>>> Anton >>>>>> >>>>>> On 20 Jul 2019, at 04:01, RD <[email protected]> wrote: >>>>>> >>>>>> I think this could be useful. When we ingest data from Kafka, we do a >>>>>> predefined set of checks on the data. We can potentially utilize >>>>>> something >>>>>> like this to check for sanity before publishing. >>>>>> >>>>>> How is the auditing process suppose to find the new snapshot , since >>>>>> it is not accessible from the table. Is it by convention? >>>>>> >>>>>> -R >>>>>> >>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> At Netflix, we have a pattern for building ETL jobs where we write >>>>>>> data, then audit the result before publishing the data that was written >>>>>>> to >>>>>>> a final table. We call this WAP for write, audit, publish. >>>>>>> >>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new >>>>>>> table snapshot, but doesn’t make that snapshot the current version of >>>>>>> the >>>>>>> table. Instead, a separate process audits the new snapshot and updates >>>>>>> the >>>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that >>>>>>> this >>>>>>> would be useful anywhere else until we talked to another company this >>>>>>> week >>>>>>> that is interested in the same thing. So I wanted to check whether this >>>>>>> is >>>>>>> a good feature to include in Iceberg itself. >>>>>>> >>>>>>> This works by staging a snapshot. Basically, Spark writes data as >>>>>>> expected, but Iceberg detects that it should not update the table’s >>>>>>> current >>>>>>> stage. That happens when there is a Spark property, spark.wap.id, >>>>>>> that indicates the job is a WAP job. Then any table that has WAP >>>>>>> enabled by >>>>>>> the table property write.wap.enabled=true will stage the new >>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s >>>>>>> metadata. >>>>>>> >>>>>>> Is this something we should open a PR to add to Iceberg? It seems a >>>>>>> little strange to make it appear that a commit has succeeded, but not >>>>>>> actually change a table, which is why we didn’t submit it before now. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> rb >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >>>> -- >>>> Filip Bocse >>>> >>> >> >> -- >> Edgar Rodriguez >> > > > -- > Ryan Blue > Software Engineer > Netflix >
