Re: [DISCUSS] Write-audit-publish support

Mouli Mukherjee Mon, 22 Jul 2019 10:00:02 -0700

This would be super helpful. We have a similar workflow where we do some
validation before letting the downstream consume the changes.


Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote:

> This definitely sounds interesting. Quick question on whether this
> presents impact on the current Upserts spec? Or is it maybe that we are
> looking to associate this support for append-only commits?
>
> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Audits run on the snapshot by setting the snapshot-id read option to
>> read the WAP snapshot, even though it has not (yet) been the current table
>> state. This is documented in the time travel
>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>> site.
>>
>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>> to table metadata, but does not make it the current table state. That is
>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>> in the staged snapshot’s metadata so processes can find it.
>>
>> I'll add a PR with this code, since there is interest.
>>
>> rb
>>
>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnyc...@apple.com>
>> wrote:
>>
>>> I would also support adding this to Iceberg itself. I think we have a
>>> use case where we can leverage this.
>>>
>>> @Ryan, could you also provide more info on the audit process?
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>>
>>> I think this could be useful. When we ingest data from Kafka, we do a
>>> predefined set of checks on the data. We can potentially utilize something
>>> like this to check for sanity before publishing.
>>>
>>> How is the auditing process suppose to find the new snapshot , since it
>>> is not accessible from the table. Is it by convention?
>>>
>>> -R
>>>
>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>> data, then audit the result before publishing the data that was written to
>>>> a final table. We call this WAP for write, audit, publish.
>>>>
>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>> table. Instead, a separate process audits the new snapshot and updates the
>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>> would be useful anywhere else until we talked to another company this week
>>>> that is interested in the same thing. So I wanted to check whether this is
>>>> a good feature to include in Iceberg itself.
>>>>
>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>> expected, but Iceberg detects that it should not update the table’s current
>>>> stage. That happens when there is a Spark property, spark.wap.id, that
>>>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>>>> table property write.wap.enabled=true will stage the new snapshot
>>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>>
>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>> little strange to make it appear that a commit has succeeded, but not
>>>> actually change a table, which is why we didn’t submit it before now.
>>>>
>>>> Thanks,
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Filip Bocse
>

Re: [DISCUSS] Write-audit-publish support

Reply via email to