Re: [DISCUSS] Write-audit-publish support

Anton Okolnychyi Mon, 11 Nov 2019 01:44:44 -0800

We would be interested in this functionality as well. We have a use case with 
multiple concurrent writers where we wanted to use WAP but couldn’t.


> On 9 Nov 2019, at 01:32, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Right now, there isn't a good way to manage multiple pending writes. 
> Snapshots from each write are created based on the current table state, so 
> simply moving to one of two pending commits would mean you ignore the changes 
> in the other pending commit. We've considered adding a "cherry-pick" 
> operation that can take the changes from one snapshot and apply them on top 
> of another to solve that problem. If you'd like to implement that, I'd be 
> happy to review it!
> 
> On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <mehta.ashis...@gmail.com 
> <mailto:mehta.ashis...@gmail.com>> wrote:
> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user 
> stage multiple WAP snapshots, and commit then in any order, based on how 
> Audit process work out?
> I wonder this expectation, goes against the underlying principles of Iceberg. 
> 
> Thanks,
> Ashish
> 
> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Ashish, you can use the rollback table operation to set a particular snapshot 
> as the current table state. Like this:
> 
> Table table = hiveCatalog.load(name);
> table.rollback().toSnapshotId(id).commmit();
> 
> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com 
> <mailto:mehta.ashis...@gmail.com>> wrote:
> Hi Ryan, 
> 
> Can you please help me point to doc, where I can find how to publish a WAP 
> snapshot? I am able to filter the snapshot, based on wap.id <http://wap.id/> 
> in summary of Snapshot, but clueless the official recommendation on 
> committing that snapshot. I can think of cherry-picking Appended/Deleted 
> files, but don't know the nuances of missing something important with this.
> 
> Thanks,
> -Ashish
>  
> ---------- Forwarded message ---------
> From: Ryan Blue <rb...@netflix.com.invalid>
> Date: Wed, Jul 31, 2019 at 4:41 PM
> Subject: Re: [DISCUSS] Write-audit-publish support
> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com 
> <mailto:edgar.rodrig...@airbnb.com>>
> Cc: Iceberg Dev List <dev@iceberg.apache.org 
> <mailto:dev@iceberg.apache.org>>, Anton Okolnychyi <aokolnyc...@apple.com 
> <mailto:aokolnyc...@apple.com>>
> 
> 
> Hi everyone, I've added PR #342 
> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg 
> repository with our WAP changes. Please have a look if you were interested in 
> this.
> 
> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <edgar.rodrig...@airbnb.com 
> <mailto:edgar.rodrig...@airbnb.com>> wrote:
> I think this use case is pretty helpful in most data environments, we do the 
> same sort of stage-check-publish pattern to run quality checks. 
> One question is, if say the audit part fails, is there a way to expire the 
> snapshot or what would be the workflow that follows?
> 
> Best,
> Edgar
> 
> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <moulimukher...@gmail.com 
> <mailto:moulimukher...@gmail.com>> wrote:
> This would be super helpful. We have a similar workflow where we do some 
> validation before letting the downstream consume the changes.
> 
> Best,
> Mouli
> 
> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com 
> <mailto:filip....@gmail.com>> wrote:
> This definitely sounds interesting. Quick question on whether this presents 
> impact on the current Upserts spec? Or is it maybe that we are looking to 
> associate this support for append-only commits?
> 
> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Audits run on the snapshot by setting the snapshot-id read option to read the 
> WAP snapshot, even though it has not (yet) been the current table state. This 
> is documented in the time travel 
> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg site.
> 
> We added a stageOnly method to SnapshotProducer that adds the snapshot to 
> table metadata, but does not make it the current table state. That is called 
> by the Spark writer when there is a WAP ID, and that ID is embedded in the 
> staged snapshot’s metadata so processes can find it.
> 
> I'll add a PR with this code, since there is interest.
> 
> rb
> 
> 
> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnyc...@apple.com 
> <mailto:aokolnyc...@apple.com>> wrote:
> I would also support adding this to Iceberg itself. I think we have a use 
> case where we can leverage this.
> 
> @Ryan, could you also provide more info on the audit process?
> 
> Thanks,
> Anton
> 
>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com <mailto:rdsr...@gmail.com>> 
>> wrote:
>> 
>> I think this could be useful. When we ingest data from Kafka, we do a 
>> predefined set of checks on the data. We can potentially utilize something 
>> like this to check for sanity before publishing.  
>> 
>> How is the auditing process suppose to find the new snapshot , since it is 
>> not accessible from the table. Is it by convention?
>> 
>> -R 
>> 
>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid 
>> <mailto:rb...@netflix.com.invalid>> wrote:
>> Hi everyone,
>> 
>> At Netflix, we have a pattern for building ETL jobs where we write data, 
>> then audit the result before publishing the data that was written to a final 
>> table. We call this WAP for write, audit, publish.
>> 
>> We’ve added support in our Iceberg branch. A WAP write creates a new table 
>> snapshot, but doesn’t make that snapshot the current version of the table. 
>> Instead, a separate process audits the new snapshot and updates the table’s 
>> current snapshot when the audits succeed. I wasn’t sure that this would be 
>> useful anywhere else until we talked to another company this week that is 
>> interested in the same thing. So I wanted to check whether this is a good 
>> feature to include in Iceberg itself.
>> 
>> This works by staging a snapshot. Basically, Spark writes data as expected, 
>> but Iceberg detects that it should not update the table’s current stage. 
>> That happens when there is a Spark property, spark.wap.id 
>> <http://spark.wap.id/>, that indicates the job is a WAP job. Then any table 
>> that has WAP enabled by the table property write.wap.enabled=true will stage 
>> the new snapshot instead of fully committing, with the WAP ID in the 
>> snapshot’s metadata.
>> 
>> Is this something we should open a PR to add to Iceberg? It seems a little 
>> strange to make it appear that a commit has succeeded, but not actually 
>> change a table, which is why we didn’t submit it before now.
>> 
>> Thanks,
>> 
>> rb
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Filip Bocse
> 
> 
> -- 
> Edgar Rodriguez
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [DISCUSS] Write-audit-publish support

Reply via email to