Re: Iceberg MV Refresh

Walaa Eldin Moustafa Thu, 20 Jun 2024 11:19:35 -0700

Benny, is the suggestion to couple the "refresh-start-timestamp-ms"
property with a grace period as well? Also, could you clarify which
timestamp "refresh-start-timestamp-ms" refers to:
(1) Timestamp when refresh is triggered
(2) Timestamp when refresh is concluded and the snapshot is written.


Also when you say "fresh as of" this timestamp, do you mean "AS OF"
construct when used to query the materialized view? Or something else? If
"AS OF" is what you meant, then this might answer my question about the
grace period, where it won't be needed.

Thanks,
Walaa.



On Thu, Jun 20, 2024 at 5:22 AM Piotr Findeisen <piotr.findei...@gmail.com>
wrote:

> Hi Benny,
>
> on the staleness topic I'd recommend to check how Trino implements
> materialized views in Iceberg and how it defines staleness.
> In particular
>
> - a view can have defined grace period which defines how stale the data
> can be for the materialization to be considered useful (defaults to
> unlimited)
> - staleness clock starts with the first table change after refresh
> - for unmanaged (non-iceberg) tables where we don't know when the table
> changed, the staleness clock starts right after refresh
>
> Best
> Piotr
>
>
>
>
>
> On Wed, 19 Jun 2024 at 19:58, Benny Chow <btc...@gmail.com> wrote:
>
>> Hey Guys,
>>
>> Great progress on the MV spec and thanks a ton to Jan and Walaa for
>> driving this.  One of our latest achievements was that we finalized the
>> view lineage and materialization table refresh JSON so that we can
>> definitively and concisely describe what data is in the materialization
>> table.
>>
>> Regarding the actual refresh process, I have two more suggestions:
>>
>> *When should a MV be refreshed?  *There could be many different refresh
>> policies such as "on table data or view change", periodic, scheduled and/or
>> manual with the goal of reducing staleness while minimizing cost to
>> refresh.  I don't think we should try to capture this configuration as part
>> of the first iteration of the MV spec.  So, I suggest we just remove the "
>> *materialization.data.max-staleness*" view property for now.  There's a
>> lot of comments on this in the spec and many contributors did suggest to
>> not include it as well.
>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA
>>
>> *When refresh is done, what information is stored to help engines
>> evaluate materialization freshness?*  We agreed on storing the view
>> lineage and materialization refresh-tables so that engines can query for
>> the current table snapshot versions and view versions and compare to what
>> is stored in the refresh-tables.  However, there could be 100s of
>> tables/views here and it could be prohibitively expensive to do this
>> check.  Instead, the engine may just use the materialization's snapshot
>> summary timestamp-ms to determine the last refresh time and assume the data
>> is fresh as of this timestamp.  However, this assumption might be naive if
>> the refresh job took 1 hour to run and source tables were queried at
>> different times throughout the execution of the job.  So, I propose we add
>> a "*refresh-start-timestamp-ms*" to the materialization snapshot summary
>> which tells users that the data in the materialization is at least as fresh
>> as of this date  (It might be fresher but not more stale).
>>
>> Thoughts?
>>
>> Thanks
>> Benny
>>
>>
>>
>>
>>
>

Re: Iceberg MV Refresh

Reply via email to