Re: [DISCUSS] Write-audit-publish support

Filip Mon, 22 Jul 2019 09:18:42 -0700

This definitely sounds interesting. Quick question on whether this presents
impact on the current Upserts spec? Or is it maybe that we are looking to
associate this support for append-only commits?


On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Audits run on the snapshot by setting the snapshot-id read option to read
> the WAP snapshot, even though it has not (yet) been the current table
> state. This is documented in the time travel
> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
> site.
>
> We added a stageOnly method to SnapshotProducer that adds the snapshot to
> table metadata, but does not make it the current table state. That is
> called by the Spark writer when there is a WAP ID, and that ID is embedded
> in the staged snapshot’s metadata so processes can find it.
>
> I'll add a PR with this code, since there is interest.
>
> rb
>
> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnyc...@apple.com>
> wrote:
>
>> I would also support adding this to Iceberg itself. I think we have a use
>> case where we can leverage this.
>>
>> @Ryan, could you also provide more info on the audit process?
>>
>> Thanks,
>> Anton
>>
>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>
>> I think this could be useful. When we ingest data from Kafka, we do a
>> predefined set of checks on the data. We can potentially utilize something
>> like this to check for sanity before publishing.
>>
>> How is the auditing process suppose to find the new snapshot , since it
>> is not accessible from the table. Is it by convention?
>>
>> -R
>>
>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> At Netflix, we have a pattern for building ETL jobs where we write data,
>>> then audit the result before publishing the data that was written to a
>>> final table. We call this WAP for write, audit, publish.
>>>
>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>> table snapshot, but doesn’t make that snapshot the current version of the
>>> table. Instead, a separate process audits the new snapshot and updates the
>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>> would be useful anywhere else until we talked to another company this week
>>> that is interested in the same thing. So I wanted to check whether this is
>>> a good feature to include in Iceberg itself.
>>>
>>> This works by staging a snapshot. Basically, Spark writes data as
>>> expected, but Iceberg detects that it should not update the table’s current
>>> stage. That happens when there is a Spark property, spark.wap.id, that
>>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>>> table property write.wap.enabled=true will stage the new snapshot
>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>
>>> Is this something we should open a PR to add to Iceberg? It seems a
>>> little strange to make it appear that a commit has succeeded, but not
>>> actually change a table, which is why we didn’t submit it before now.
>>>
>>> Thanks,
>>>
>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Filip Bocse

Re: [DISCUSS] Write-audit-publish support

Reply via email to