Re: [DISCUSS] Write-audit-publish support

Ashish Mehta Fri, 08 Nov 2019 15:29:58 -0800

Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
stage multiple WAP snapshots, and commit then in any order, based on how
Audit process work out?
I wonder this expectation, goes against the underlying principles of
Iceberg.


Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Ashish, you can use the rollback table operation to set a particular
> snapshot as the current table state. Like this:
>
> Table table = hiveCatalog.load(name);
> table.rollback().toSnapshotId(id).commmit();
>
>
> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com>
> wrote:
>
>> Hi Ryan,
>>
>> Can you please help me point to doc, where I can find how to publish a
>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>> summary of Snapshot, but clueless the official recommendation on
>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>> files, but don't know the nuances of missing something important with this.
>>
>> Thanks,
>> -Ashish
>>
>>
>>> ---------- Forwarded message ---------
>>> From: Ryan Blue <rb...@netflix.com.invalid>
>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com>
>>> Cc: Iceberg Dev List <dev@iceberg.apache.org>, Anton Okolnychyi <
>>> aokolnyc...@apple.com>
>>>
>>>
>>> Hi everyone, I've added PR #342
>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>> repository with our WAP changes. Please have a look if you were interested
>>> in this.
>>>
>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>> edgar.rodrig...@airbnb.com> wrote:
>>>
>>>> I think this use case is pretty helpful in most data environments, we
>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>> One question is, if say the audit part fails, is there a way to expire
>>>> the snapshot or what would be the workflow that follows?
>>>>
>>>> Best,
>>>> Edgar
>>>>
>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>> moulimukher...@gmail.com> wrote:
>>>>
>>>>> This would be super helpful. We have a similar workflow where we do
>>>>> some validation before letting the downstream consume the changes.
>>>>>
>>>>> Best,
>>>>> Mouli
>>>>>
>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote:
>>>>>
>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>> looking to associate this support for append-only commits?
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>> table state. This is documented in the time travel
>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>> Iceberg site.
>>>>>>>
>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>> snapshot to table metadata, but does not make it the current table 
>>>>>>> state.
>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID 
>>>>>>> is
>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>
>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>
>>>>>>>> I would also support adding this to Iceberg itself. I think we have
>>>>>>>> a use case where we can leverage this.
>>>>>>>>
>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anton
>>>>>>>>
>>>>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I think this could be useful. When we ingest data from Kafka, we do
>>>>>>>> a predefined set of checks on the data. We can potentially utilize
>>>>>>>> something like this to check for sanity before publishing.
>>>>>>>>
>>>>>>>> How is the auditing process suppose to find the new snapshot ,
>>>>>>>> since it is not accessible from the table. Is it by convention?
>>>>>>>>
>>>>>>>> -R
>>>>>>>>
>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <
>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>>>> data, then audit the result before publishing the data that was 
>>>>>>>>> written to
>>>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>>>
>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current 
>>>>>>>>> version of
>>>>>>>>> the table. Instead, a separate process audits the new snapshot and 
>>>>>>>>> updates
>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure 
>>>>>>>>> that
>>>>>>>>> this would be useful anywhere else until we talked to another company 
>>>>>>>>> this
>>>>>>>>> week that is interested in the same thing. So I wanted to check 
>>>>>>>>> whether
>>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>>
>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>>>> expected, but Iceberg detects that it should not update the table’s 
>>>>>>>>> current
>>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP 
>>>>>>>>> enabled by
>>>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>>>> snapshot instead of fully committing, with the WAP ID in the 
>>>>>>>>> snapshot’s
>>>>>>>>> metadata.
>>>>>>>>>
>>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems
>>>>>>>>> a little strange to make it appear that a commit has succeeded, but 
>>>>>>>>> not
>>>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Filip Bocse
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Edgar Rodriguez
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Write-audit-publish support

Reply via email to