Thanks Ryan, that worked out. Since its a rollback, I wonder how can user stage multiple WAP snapshots, and commit then in any order, based on how Audit process work out? I wonder this expectation, goes against the underlying principles of Iceberg.
Thanks, Ashish On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <[email protected]> wrote: > Ashish, you can use the rollback table operation to set a particular > snapshot as the current table state. Like this: > > Table table = hiveCatalog.load(name); > table.rollback().toSnapshotId(id).commmit(); > > > On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <[email protected]> > wrote: > >> Hi Ryan, >> >> Can you please help me point to doc, where I can find how to publish a >> WAP snapshot? I am able to filter the snapshot, based on wap.id in >> summary of Snapshot, but clueless the official recommendation on >> committing that snapshot. I can think of cherry-picking Appended/Deleted >> files, but don't know the nuances of missing something important with this. >> >> Thanks, >> -Ashish >> >> >>> ---------- Forwarded message --------- >>> From: Ryan Blue <[email protected]> >>> Date: Wed, Jul 31, 2019 at 4:41 PM >>> Subject: Re: [DISCUSS] Write-audit-publish support >>> To: Edgar Rodriguez <[email protected]> >>> Cc: Iceberg Dev List <[email protected]>, Anton Okolnychyi < >>> [email protected]> >>> >>> >>> Hi everyone, I've added PR #342 >>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg >>> repository with our WAP changes. Please have a look if you were interested >>> in this. >>> >>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez < >>> [email protected]> wrote: >>> >>>> I think this use case is pretty helpful in most data environments, we >>>> do the same sort of stage-check-publish pattern to run quality checks. >>>> One question is, if say the audit part fails, is there a way to expire >>>> the snapshot or what would be the workflow that follows? >>>> >>>> Best, >>>> Edgar >>>> >>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee < >>>> [email protected]> wrote: >>>> >>>>> This would be super helpful. We have a similar workflow where we do >>>>> some validation before letting the downstream consume the changes. >>>>> >>>>> Best, >>>>> Mouli >>>>> >>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <[email protected]> wrote: >>>>> >>>>>> This definitely sounds interesting. Quick question on whether this >>>>>> presents impact on the current Upserts spec? Or is it maybe that we are >>>>>> looking to associate this support for append-only commits? >>>>>> >>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Audits run on the snapshot by setting the snapshot-id read option >>>>>>> to read the WAP snapshot, even though it has not (yet) been the current >>>>>>> table state. This is documented in the time travel >>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the >>>>>>> Iceberg site. >>>>>>> >>>>>>> We added a stageOnly method to SnapshotProducer that adds the >>>>>>> snapshot to table metadata, but does not make it the current table >>>>>>> state. >>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID >>>>>>> is >>>>>>> embedded in the staged snapshot’s metadata so processes can find it. >>>>>>> >>>>>>> I'll add a PR with this code, since there is interest. >>>>>>> >>>>>>> rb >>>>>>> >>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I would also support adding this to Iceberg itself. I think we have >>>>>>>> a use case where we can leverage this. >>>>>>>> >>>>>>>> @Ryan, could you also provide more info on the audit process? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Anton >>>>>>>> >>>>>>>> On 20 Jul 2019, at 04:01, RD <[email protected]> wrote: >>>>>>>> >>>>>>>> I think this could be useful. When we ingest data from Kafka, we do >>>>>>>> a predefined set of checks on the data. We can potentially utilize >>>>>>>> something like this to check for sanity before publishing. >>>>>>>> >>>>>>>> How is the auditing process suppose to find the new snapshot , >>>>>>>> since it is not accessible from the table. Is it by convention? >>>>>>>> >>>>>>>> -R >>>>>>>> >>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write >>>>>>>>> data, then audit the result before publishing the data that was >>>>>>>>> written to >>>>>>>>> a final table. We call this WAP for write, audit, publish. >>>>>>>>> >>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a >>>>>>>>> new table snapshot, but doesn’t make that snapshot the current >>>>>>>>> version of >>>>>>>>> the table. Instead, a separate process audits the new snapshot and >>>>>>>>> updates >>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure >>>>>>>>> that >>>>>>>>> this would be useful anywhere else until we talked to another company >>>>>>>>> this >>>>>>>>> week that is interested in the same thing. So I wanted to check >>>>>>>>> whether >>>>>>>>> this is a good feature to include in Iceberg itself. >>>>>>>>> >>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as >>>>>>>>> expected, but Iceberg detects that it should not update the table’s >>>>>>>>> current >>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id, >>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP >>>>>>>>> enabled by >>>>>>>>> the table property write.wap.enabled=true will stage the new >>>>>>>>> snapshot instead of fully committing, with the WAP ID in the >>>>>>>>> snapshot’s >>>>>>>>> metadata. >>>>>>>>> >>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems >>>>>>>>> a little strange to make it appear that a commit has succeeded, but >>>>>>>>> not >>>>>>>>> actually change a table, which is why we didn’t submit it before now. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> rb >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Filip Bocse >>>>>> >>>>> >>>> >>>> -- >>>> Edgar Rodriguez >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >
