We would be interested in this functionality as well. We have a use case with multiple concurrent writers where we wanted to use WAP but couldn’t.
> On 9 Nov 2019, at 01:32, Ryan Blue <[email protected]> wrote: > > Right now, there isn't a good way to manage multiple pending writes. > Snapshots from each write are created based on the current table state, so > simply moving to one of two pending commits would mean you ignore the changes > in the other pending commit. We've considered adding a "cherry-pick" > operation that can take the changes from one snapshot and apply them on top > of another to solve that problem. If you'd like to implement that, I'd be > happy to review it! > > On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <[email protected] > <mailto:[email protected]>> wrote: > Thanks Ryan, that worked out. Since its a rollback, I wonder how can user > stage multiple WAP snapshots, and commit then in any order, based on how > Audit process work out? > I wonder this expectation, goes against the underlying principles of Iceberg. > > Thanks, > Ashish > > On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <[email protected]> wrote: > Ashish, you can use the rollback table operation to set a particular snapshot > as the current table state. Like this: > > Table table = hiveCatalog.load(name); > table.rollback().toSnapshotId(id).commmit(); > > On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <[email protected] > <mailto:[email protected]>> wrote: > Hi Ryan, > > Can you please help me point to doc, where I can find how to publish a WAP > snapshot? I am able to filter the snapshot, based on wap.id <http://wap.id/> > in summary of Snapshot, but clueless the official recommendation on > committing that snapshot. I can think of cherry-picking Appended/Deleted > files, but don't know the nuances of missing something important with this. > > Thanks, > -Ashish > > ---------- Forwarded message --------- > From: Ryan Blue <[email protected]> > Date: Wed, Jul 31, 2019 at 4:41 PM > Subject: Re: [DISCUSS] Write-audit-publish support > To: Edgar Rodriguez <[email protected] > <mailto:[email protected]>> > Cc: Iceberg Dev List <[email protected] > <mailto:[email protected]>>, Anton Okolnychyi <[email protected] > <mailto:[email protected]>> > > > Hi everyone, I've added PR #342 > <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg > repository with our WAP changes. Please have a look if you were interested in > this. > > On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <[email protected] > <mailto:[email protected]>> wrote: > I think this use case is pretty helpful in most data environments, we do the > same sort of stage-check-publish pattern to run quality checks. > One question is, if say the audit part fails, is there a way to expire the > snapshot or what would be the workflow that follows? > > Best, > Edgar > > On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <[email protected] > <mailto:[email protected]>> wrote: > This would be super helpful. We have a similar workflow where we do some > validation before letting the downstream consume the changes. > > Best, > Mouli > > On Mon, Jul 22, 2019 at 9:18 AM Filip <[email protected] > <mailto:[email protected]>> wrote: > This definitely sounds interesting. Quick question on whether this presents > impact on the current Upserts spec? Or is it maybe that we are looking to > associate this support for append-only commits? > > On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <[email protected]> wrote: > Audits run on the snapshot by setting the snapshot-id read option to read the > WAP snapshot, even though it has not (yet) been the current table state. This > is documented in the time travel > <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg site. > > We added a stageOnly method to SnapshotProducer that adds the snapshot to > table metadata, but does not make it the current table state. That is called > by the Spark writer when there is a WAP ID, and that ID is embedded in the > staged snapshot’s metadata so processes can find it. > > I'll add a PR with this code, since there is interest. > > rb > > > On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <[email protected] > <mailto:[email protected]>> wrote: > I would also support adding this to Iceberg itself. I think we have a use > case where we can leverage this. > > @Ryan, could you also provide more info on the audit process? > > Thanks, > Anton > >> On 20 Jul 2019, at 04:01, RD <[email protected] <mailto:[email protected]>> >> wrote: >> >> I think this could be useful. When we ingest data from Kafka, we do a >> predefined set of checks on the data. We can potentially utilize something >> like this to check for sanity before publishing. >> >> How is the auditing process suppose to find the new snapshot , since it is >> not accessible from the table. Is it by convention? >> >> -R >> >> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <[email protected] >> <mailto:[email protected]>> wrote: >> Hi everyone, >> >> At Netflix, we have a pattern for building ETL jobs where we write data, >> then audit the result before publishing the data that was written to a final >> table. We call this WAP for write, audit, publish. >> >> We’ve added support in our Iceberg branch. A WAP write creates a new table >> snapshot, but doesn’t make that snapshot the current version of the table. >> Instead, a separate process audits the new snapshot and updates the table’s >> current snapshot when the audits succeed. I wasn’t sure that this would be >> useful anywhere else until we talked to another company this week that is >> interested in the same thing. So I wanted to check whether this is a good >> feature to include in Iceberg itself. >> >> This works by staging a snapshot. Basically, Spark writes data as expected, >> but Iceberg detects that it should not update the table’s current stage. >> That happens when there is a Spark property, spark.wap.id >> <http://spark.wap.id/>, that indicates the job is a WAP job. Then any table >> that has WAP enabled by the table property write.wap.enabled=true will stage >> the new snapshot instead of fully committing, with the WAP ID in the >> snapshot’s metadata. >> >> Is this something we should open a PR to add to Iceberg? It seems a little >> strange to make it appear that a commit has succeeded, but not actually >> change a table, which is why we didn’t submit it before now. >> >> Thanks, >> >> rb >> >> -- >> Ryan Blue >> Software Engineer >> Netflix > > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- > Filip Bocse > > > -- > Edgar Rodriguez > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- > Ryan Blue > Software Engineer > Netflix
