Re: Iceberg MV Refresh

Daniel Weeks Fri, 21 Jun 2024 08:14:55 -0700

Benny, I think you bring up a good point about staleness in that
different clients may want different behaviors.  However, defining a "grace
period" or "max staleness" is pretty common and makes a lot of sense when
working with expensive queries.  Trino
<https://trino.io/docs/current/sql/create-materialized-view.html#synopsis>
and BigQuery
<https://cloud.google.com/bigquery/docs/materialized-views-create#max_staleness>
(I
found some references to traditional OLTP databases as well) support this
concept and I think it's helpful to standardize because it's a common
pattern especially where clients may not have the ability to refresh or
fallback to the view definition.


Maybe adding the property and noting that it's acceptable for clients to
override the behavior with query specific options?

-Dan



On Thu, Jun 20, 2024 at 9:47 PM Jan Kaul <[email protected]>
wrote:

> Thanks Benny for bringing these issues up. I would agree with both of your
> propositions.
>
> Regarding the naming of the fields, we can go with the naming that you
> suggested. I just wanted to wait if some more people chime in with their
> opinions.
>
> Jan
> On 20.06.24 23:16, Benny Chow wrote:
>
> > So basically this is just FYI and it is up to the consumer to assume
> what to do given the length of time between that timestamp and now?
> Yes.. some consumers will use it.. some won't.   Only the engine producing
> the materialization will know when it started the refresh job.
>
> > Would the decision be on the engine implementation side (which does not
> have the domain knowledge), or would it be on the user's side?
> I think the decision would mostly be on the engine implementation side.  I
> would imagine an engine would have a bunch of configuration properties
> associated with a view that control how/when the refresh is performed and
> settings around when the planner can still use the view's materialization
> when it is stale.  It might be possible for the user to specify hints in
> the query or session/connection properties to control the staleness
> tolerance too.
>
> Benny
>
>
> On Thu, Jun 20, 2024 at 12:17 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> So basically this is just FYI and it is up to the consumer to assume what
>> to do given the length of time between that timestamp and now? Would the
>> decision be on the engine implementation side (which does not have the
>> domain knowledge), or would it be on the user's side? In the latter case
>> the user would need to explicitly query the storage table directly,
>> correct? With a grace period I think we could push it down to the engine.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Thu, Jun 20, 2024 at 12:06 PM Benny Chow <[email protected]> wrote:
>>
>>> Piotr, thanks for the Trino pointers.  I noticed that Trino stores the
>>> refresh start time as a snapshot summary property here
>>> <https://github.com/trinodb/trino/blob/6697fe24481a30d37eb91efd62666165acf379c2/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L3027>.
>>> I think this is exactly what I am asking for with
>>> "refresh-start-timestamp-ms".
>>>
>>> Walaa, no, the suggestion is to not have a grace period as this is
>>> engine specific on how it wants to handle staleness.
>>>  "refresh-start-timestamp-ms" refers to 1 and not 2.  We should already
>>> have 2 in the snapshot summary timestamp-ms property.
>>>
>>> When I say "fresh as of", I don't mean the AS OF construct.  It's just
>>> making a guarantee to the consumer of the MV that the materialization
>>> contains data that is "fresh as of" a certain timestamp.  So like, if you
>>> built a materialization on top of 100 tables (possibly a mix of Iceberg and
>>> non-Iceberg) and you know that the refresh job ran on say 6/20/2024
>>> 12:02:10 UTC, then whatever data is in the materialization has to be "fresh
>>> as of" 6/20/2024 12:02:10 UTC.
>>>
>>> Thanks
>>> Benny
>>>
>>>
>>>
>>>
>>> On Thu, Jun 20, 2024 at 11:19 AM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
>>>> Benny, is the suggestion to couple the "refresh-start-timestamp-ms"
>>>> property with a grace period as well? Also, could you clarify which
>>>> timestamp "refresh-start-timestamp-ms" refers to:
>>>> (1) Timestamp when refresh is triggered
>>>> (2) Timestamp when refresh is concluded and the snapshot is written.
>>>>
>>>> Also when you say "fresh as of" this timestamp, do you mean "AS OF"
>>>> construct when used to query the materialized view? Or something else? If
>>>> "AS OF" is what you meant, then this might answer my question about the
>>>> grace period, where it won't be needed.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 20, 2024 at 5:22 AM Piotr Findeisen <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Benny,
>>>>>
>>>>> on the staleness topic I'd recommend to check how Trino implements
>>>>> materialized views in Iceberg and how it defines staleness.
>>>>> In particular
>>>>>
>>>>> - a view can have defined grace period which defines how stale the
>>>>> data can be for the materialization to be considered useful (defaults to
>>>>> unlimited)
>>>>> - staleness clock starts with the first table change after refresh
>>>>> - for unmanaged (non-iceberg) tables where we don't know when the
>>>>> table changed, the staleness clock starts right after refresh
>>>>>
>>>>> Best
>>>>> Piotr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 19 Jun 2024 at 19:58, Benny Chow <[email protected]> wrote:
>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>> Great progress on the MV spec and thanks a ton to Jan and Walaa for
>>>>>> driving this.  One of our latest achievements was that we finalized the
>>>>>> view lineage and materialization table refresh JSON so that we can
>>>>>> definitively and concisely describe what data is in the materialization
>>>>>> table.
>>>>>>
>>>>>> Regarding the actual refresh process, I have two more suggestions:
>>>>>>
>>>>>> *When should a MV be refreshed?  *There could be many different
>>>>>> refresh policies such as "on table data or view change", periodic,
>>>>>> scheduled and/or manual with the goal of reducing staleness while
>>>>>> minimizing cost to refresh.  I don't think we should try to capture this
>>>>>> configuration as part of the first iteration of the MV spec.  So, I 
>>>>>> suggest
>>>>>> we just remove the "*materialization.data.max-staleness*" view
>>>>>> property for now.  There's a lot of comments on this in the spec and many
>>>>>> contributors did suggest to not include it as well.
>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA
>>>>>>
>>>>>> *When refresh is done, what information is stored to help engines
>>>>>> evaluate materialization freshness?*  We agreed on storing the view
>>>>>> lineage and materialization refresh-tables so that engines can query for
>>>>>> the current table snapshot versions and view versions and compare to what
>>>>>> is stored in the refresh-tables.  However, there could be 100s of
>>>>>> tables/views here and it could be prohibitively expensive to do this
>>>>>> check.  Instead, the engine may just use the materialization's snapshot
>>>>>> summary timestamp-ms to determine the last refresh time and assume the 
>>>>>> data
>>>>>> is fresh as of this timestamp.  However, this assumption might be naive 
>>>>>> if
>>>>>> the refresh job took 1 hour to run and source tables were queried at
>>>>>> different times throughout the execution of the job.  So, I propose we 
>>>>>> add
>>>>>> a "*refresh-start-timestamp-ms*" to the materialization snapshot
>>>>>> summary which tells users that the data in the materialization is at 
>>>>>> least
>>>>>> as fresh as of this date  (It might be fresher but not more stale).
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Thanks
>>>>>> Benny
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>

Re: Iceberg MV Refresh

Reply via email to