I just had a direct request for this over the weekend, too. I opened #629 Add cherry-pick operation <https://github.com/apache/incubator-iceberg/issues/629> to track this.
On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi <aokolnyc...@apple.com> wrote: > We would be interested in this functionality as well. We have a use case > with multiple concurrent writers where we wanted to use WAP but couldn’t. > > On 9 Nov 2019, at 01:32, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Right now, there isn't a good way to manage multiple pending writes. > Snapshots from each write are created based on the current table state, so > simply moving to one of two pending commits would mean you ignore the > changes in the other pending commit. We've considered adding a > "cherry-pick" operation that can take the changes from one snapshot and > apply them on top of another to solve that problem. If you'd like to > implement that, I'd be happy to review it! > > On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <mehta.ashis...@gmail.com> > wrote: > >> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user >> stage multiple WAP snapshots, and commit then in any order, based on how >> Audit process work out? >> I wonder this expectation, goes against the underlying principles of >> Iceberg. >> >> Thanks, >> Ashish >> >> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Ashish, you can use the rollback table operation to set a particular >>> snapshot as the current table state. Like this: >>> >>> Table table = hiveCatalog.load(name); >>> table.rollback().toSnapshotId(id).commmit(); >>> >>> >>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com> >>> wrote: >>> >>>> Hi Ryan, >>>> >>>> Can you please help me point to doc, where I can find how to publish a >>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in >>>> summary of Snapshot, but clueless the official recommendation on >>>> committing that snapshot. I can think of cherry-picking Appended/Deleted >>>> files, but don't know the nuances of missing something important with this. >>>> >>>> Thanks, >>>> -Ashish >>>> >>>> >>>>> ---------- Forwarded message --------- >>>>> From: Ryan Blue <rb...@netflix.com.invalid> >>>>> Date: Wed, Jul 31, 2019 at 4:41 PM >>>>> Subject: Re: [DISCUSS] Write-audit-publish support >>>>> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com> >>>>> Cc: Iceberg Dev List <dev@iceberg.apache.org>, Anton Okolnychyi < >>>>> aokolnyc...@apple.com> >>>>> >>>>> >>>>> Hi everyone, I've added PR #342 >>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg >>>>> repository with our WAP changes. Please have a look if you were interested >>>>> in this. >>>>> >>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez < >>>>> edgar.rodrig...@airbnb.com> wrote: >>>>> >>>>>> I think this use case is pretty helpful in most data environments, we >>>>>> do the same sort of stage-check-publish pattern to run quality checks. >>>>>> One question is, if say the audit part fails, is there a way to >>>>>> expire the snapshot or what would be the workflow that follows? >>>>>> >>>>>> Best, >>>>>> Edgar >>>>>> >>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee < >>>>>> moulimukher...@gmail.com> wrote: >>>>>> >>>>>>> This would be super helpful. We have a similar workflow where we do >>>>>>> some validation before letting the downstream consume the changes. >>>>>>> >>>>>>> Best, >>>>>>> Mouli >>>>>>> >>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote: >>>>>>> >>>>>>>> This definitely sounds interesting. Quick question on whether this >>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are >>>>>>>> looking to associate this support for append-only commits? >>>>>>>> >>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue < >>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>> >>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option >>>>>>>>> to read the WAP snapshot, even though it has not (yet) been the >>>>>>>>> current >>>>>>>>> table state. This is documented in the time travel >>>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the >>>>>>>>> Iceberg site. >>>>>>>>> >>>>>>>>> We added a stageOnly method to SnapshotProducer that adds the >>>>>>>>> snapshot to table metadata, but does not make it the current table >>>>>>>>> state. >>>>>>>>> That is called by the Spark writer when there is a WAP ID, and that >>>>>>>>> ID is >>>>>>>>> embedded in the staged snapshot’s metadata so processes can find it. >>>>>>>>> >>>>>>>>> I'll add a PR with this code, since there is interest. >>>>>>>>> >>>>>>>>> rb >>>>>>>>> >>>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi < >>>>>>>>> aokolnyc...@apple.com> wrote: >>>>>>>>> >>>>>>>>>> I would also support adding this to Iceberg itself. I think we >>>>>>>>>> have a use case where we can leverage this. >>>>>>>>>> >>>>>>>>>> @Ryan, could you also provide more info on the audit process? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Anton >>>>>>>>>> >>>>>>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> I think this could be useful. When we ingest data from Kafka, we >>>>>>>>>> do a predefined set of checks on the data. We can potentially utilize >>>>>>>>>> something like this to check for sanity before publishing. >>>>>>>>>> >>>>>>>>>> How is the auditing process suppose to find the new snapshot , >>>>>>>>>> since it is not accessible from the table. Is it by convention? >>>>>>>>>> >>>>>>>>>> -R >>>>>>>>>> >>>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue < >>>>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we >>>>>>>>>>> write data, then audit the result before publishing the data that >>>>>>>>>>> was >>>>>>>>>>> written to a final table. We call this WAP for write, audit, >>>>>>>>>>> publish. >>>>>>>>>>> >>>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a >>>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current >>>>>>>>>>> version of >>>>>>>>>>> the table. Instead, a separate process audits the new snapshot and >>>>>>>>>>> updates >>>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure >>>>>>>>>>> that >>>>>>>>>>> this would be useful anywhere else until we talked to another >>>>>>>>>>> company this >>>>>>>>>>> week that is interested in the same thing. So I wanted to check >>>>>>>>>>> whether >>>>>>>>>>> this is a good feature to include in Iceberg itself. >>>>>>>>>>> >>>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data >>>>>>>>>>> as expected, but Iceberg detects that it should not update the >>>>>>>>>>> table’s >>>>>>>>>>> current stage. That happens when there is a Spark property, >>>>>>>>>>> spark.wap.id, that indicates the job is a WAP job. Then any >>>>>>>>>>> table that has WAP enabled by the table property >>>>>>>>>>> write.wap.enabled=true will stage the new snapshot instead of >>>>>>>>>>> fully committing, with the WAP ID in the snapshot’s metadata. >>>>>>>>>>> >>>>>>>>>>> Is this something we should open a PR to add to Iceberg? It >>>>>>>>>>> seems a little strange to make it appear that a commit has >>>>>>>>>>> succeeded, but >>>>>>>>>>> not actually change a table, which is why we didn’t submit it >>>>>>>>>>> before now. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> rb >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Software Engineer >>>>>>>>>>> Netflix >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Filip Bocse >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Edgar Rodriguez >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix