[DISCUSS] Write-audit-publish support

Ashish Mehta Fri, 08 Nov 2019 12:53:19 -0800

Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP
snapshot? I am able to filter the snapshot, based on wap.id in summary of
Snapshot, but clueless the official recommendation on committing that
snapshot. I can think of cherry-picking Appended/Deleted files, but don't
know the nuances of missing something important with this.


Thanks,
-Ashish


> ---------- Forwarded message ---------
> From: Ryan Blue <[email protected]>
> Date: Wed, Jul 31, 2019 at 4:41 PM
> Subject: Re: [DISCUSS] Write-audit-publish support
> To: Edgar Rodriguez <[email protected]>
> Cc: Iceberg Dev List <[email protected]>, Anton Okolnychyi <
> [email protected]>
>
>
> Hi everyone, I've added PR #342
> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
> repository with our WAP changes. Please have a look if you were interested
> in this.
>
> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
> [email protected]> wrote:
>
>> I think this use case is pretty helpful in most data environments, we do
>> the same sort of stage-check-publish pattern to run quality checks.
>> One question is, if say the audit part fails, is there a way to expire
>> the snapshot or what would be the workflow that follows?
>>
>> Best,
>> Edgar
>>
>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <[email protected]>
>> wrote:
>>
>>> This would be super helpful. We have a similar workflow where we do some
>>> validation before letting the downstream consume the changes.
>>>
>>> Best,
>>> Mouli
>>>
>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <[email protected]> wrote:
>>>
>>>> This definitely sounds interesting. Quick question on whether this
>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>> looking to associate this support for append-only commits?
>>>>
>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <[email protected]>
>>>> wrote:
>>>>
>>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>>> read the WAP snapshot, even though it has not (yet) been the current table
>>>>> state. This is documented in the time travel
>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>>>> site.
>>>>>
>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>
>>>>> I'll add a PR with this code, since there is interest.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I would also support adding this to Iceberg itself. I think we have a
>>>>>> use case where we can leverage this.
>>>>>>
>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>
>>>>>> Thanks,
>>>>>> Anton
>>>>>>
>>>>>> On 20 Jul 2019, at 04:01, RD <[email protected]> wrote:
>>>>>>
>>>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>>>> predefined set of checks on the data. We can potentially utilize 
>>>>>> something
>>>>>> like this to check for sanity before publishing.
>>>>>>
>>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>>> it is not accessible from the table. Is it by convention?
>>>>>>
>>>>>> -R
>>>>>>
>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>> data, then audit the result before publishing the data that was written 
>>>>>>> to
>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>
>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>>>> table snapshot, but doesn’t make that snapshot the current version of 
>>>>>>> the
>>>>>>> table. Instead, a separate process audits the new snapshot and updates 
>>>>>>> the
>>>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that 
>>>>>>> this
>>>>>>> would be useful anywhere else until we talked to another company this 
>>>>>>> week
>>>>>>> that is interested in the same thing. So I wanted to check whether this 
>>>>>>> is
>>>>>>> a good feature to include in Iceberg itself.
>>>>>>>
>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>> expected, but Iceberg detects that it should not update the table’s 
>>>>>>> current
>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>> that indicates the job is a WAP job. Then any table that has WAP 
>>>>>>> enabled by
>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>>> metadata.
>>>>>>>
>>>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> rb
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>> --
>>>> Filip Bocse
>>>>
>>>
>>
>> --
>> Edgar Rodriguez
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

[DISCUSS] Write-audit-publish support

Reply via email to