Re: [DISCUSS] Write-audit-publish support

Ryan Blue Mon, 11 Nov 2019 11:55:28 -0800

I just had a direct request for this over the weekend, too. I opened #629
Add cherry-pick operation
<https://github.com/apache/incubator-iceberg/issues/629> to track this.


On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi <aokolnyc...@apple.com>
wrote:

> We would be interested in this functionality as well. We have a use case
> with multiple concurrent writers where we wanted to use WAP but couldn’t.
>
> On 9 Nov 2019, at 01:32, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> Right now, there isn't a good way to manage multiple pending writes.
> Snapshots from each write are created based on the current table state, so
> simply moving to one of two pending commits would mean you ignore the
> changes in the other pending commit. We've considered adding a
> "cherry-pick" operation that can take the changes from one snapshot and
> apply them on top of another to solve that problem. If you'd like to
> implement that, I'd be happy to review it!
>
> On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <mehta.ashis...@gmail.com>
> wrote:
>
>> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
>> stage multiple WAP snapshots, and commit then in any order, based on how
>> Audit process work out?
>> I wonder this expectation, goes against the underlying principles of
>> Iceberg.
>>
>> Thanks,
>> Ashish
>>
>> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Ashish, you can use the rollback table operation to set a particular
>>> snapshot as the current table state. Like this:
>>>
>>> Table table = hiveCatalog.load(name);
>>> table.rollback().toSnapshotId(id).commmit();
>>>
>>>
>>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> Can you please help me point to doc, where I can find how to publish a
>>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>>>> summary of Snapshot, but clueless the official recommendation on
>>>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>>>> files, but don't know the nuances of missing something important with this.
>>>>
>>>> Thanks,
>>>> -Ashish
>>>>
>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Ryan Blue <rb...@netflix.com.invalid>
>>>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>>>> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com>
>>>>> Cc: Iceberg Dev List <dev@iceberg.apache.org>, Anton Okolnychyi <
>>>>> aokolnyc...@apple.com>
>>>>>
>>>>>
>>>>> Hi everyone, I've added PR #342
>>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>>>> repository with our WAP changes. Please have a look if you were interested
>>>>> in this.
>>>>>
>>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>>>> edgar.rodrig...@airbnb.com> wrote:
>>>>>
>>>>>> I think this use case is pretty helpful in most data environments, we
>>>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>>>> One question is, if say the audit part fails, is there a way to
>>>>>> expire the snapshot or what would be the workflow that follows?
>>>>>>
>>>>>> Best,
>>>>>> Edgar
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>>>> moulimukher...@gmail.com> wrote:
>>>>>>
>>>>>>> This would be super helpful. We have a similar workflow where we do
>>>>>>> some validation before letting the downstream consume the changes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Mouli
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote:
>>>>>>>
>>>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>>>> looking to associate this support for append-only commits?
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <
>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>>>> to read the WAP snapshot, even though it has not (yet) been the 
>>>>>>>>> current
>>>>>>>>> table state. This is documented in the time travel
>>>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>>>> Iceberg site.
>>>>>>>>>
>>>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>>>> snapshot to table metadata, but does not make it the current table 
>>>>>>>>> state.
>>>>>>>>> That is called by the Spark writer when there is a WAP ID, and that 
>>>>>>>>> ID is
>>>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>>>
>>>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>>
>>>>>>>>>> I would also support adding this to Iceberg itself. I think we
>>>>>>>>>> have a use case where we can leverage this.
>>>>>>>>>>
>>>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Anton
>>>>>>>>>>
>>>>>>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I think this could be useful. When we ingest data from Kafka, we
>>>>>>>>>> do a predefined set of checks on the data. We can potentially utilize
>>>>>>>>>> something like this to check for sanity before publishing.
>>>>>>>>>>
>>>>>>>>>> How is the auditing process suppose to find the new snapshot ,
>>>>>>>>>> since it is not accessible from the table. Is it by convention?
>>>>>>>>>>
>>>>>>>>>> -R
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <
>>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we
>>>>>>>>>>> write data, then audit the result before publishing the data that 
>>>>>>>>>>> was
>>>>>>>>>>> written to a final table. We call this WAP for write, audit, 
>>>>>>>>>>> publish.
>>>>>>>>>>>
>>>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current 
>>>>>>>>>>> version of
>>>>>>>>>>> the table. Instead, a separate process audits the new snapshot and 
>>>>>>>>>>> updates
>>>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure 
>>>>>>>>>>> that
>>>>>>>>>>> this would be useful anywhere else until we talked to another 
>>>>>>>>>>> company this
>>>>>>>>>>> week that is interested in the same thing. So I wanted to check 
>>>>>>>>>>> whether
>>>>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>>>>
>>>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data
>>>>>>>>>>> as expected, but Iceberg detects that it should not update the 
>>>>>>>>>>> table’s
>>>>>>>>>>> current stage. That happens when there is a Spark property,
>>>>>>>>>>> spark.wap.id, that indicates the job is a WAP job. Then any
>>>>>>>>>>> table that has WAP enabled by the table property
>>>>>>>>>>> write.wap.enabled=true will stage the new snapshot instead of
>>>>>>>>>>> fully committing, with the WAP ID in the snapshot’s metadata.
>>>>>>>>>>>
>>>>>>>>>>> Is this something we should open a PR to add to Iceberg? It
>>>>>>>>>>> seems a little strange to make it appear that a commit has 
>>>>>>>>>>> succeeded, but
>>>>>>>>>>> not actually change a table, which is why we didn’t submit it 
>>>>>>>>>>> before now.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Filip Bocse
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Edgar Rodriguez
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Reply via email to