Thanks Benny for bringing these issues up. I would agree with both of
your propositions.
Regarding the naming of the fields, we can go with the naming that you
suggested. I just wanted to wait if some more people chime in with their
opinions.
Jan
On 20.06.24 23:16, Benny Chow wrote:
> So basically this is just FYI and it is up to the consumer to assume
what to do given the length of time between that timestamp and now?
Yes.. some consumers will use it.. some won't. Only the engine
producing the materialization will know when it started the refresh job.
> Would the decision be on the engine implementation side (which does
not have the domain knowledge), or would it be on the user's side?
I think the decision would mostly be on the engine implementation
side. I would imagine an engine would have a bunch of configuration
properties associated with a view that control how/when the refresh is
performed and settings around when the planner can still use the
view's materialization when it is stale. It might be possible for the
user to specify hints in the query or session/connection properties to
control the staleness tolerance too.
Benny
On Thu, Jun 20, 2024 at 12:17 PM Walaa Eldin Moustafa
<[email protected]> wrote:
So basically this is just FYI and it is up to the consumer to
assume what to do given the length of time between that timestamp
and now? Would the decision be on the engine implementation side
(which does not have the domain knowledge), or would it be on the
user's side? In the latter case the user would need to explicitly
query the storage table directly, correct? With a grace period I
think we could push it down to the engine.
Thanks,
Walaa.
On Thu, Jun 20, 2024 at 12:06 PM Benny Chow <[email protected]> wrote:
Piotr, thanks for the Trino pointers. I noticed that Trino
stores the refresh start time as a snapshot summary property
here
<https://github.com/trinodb/trino/blob/6697fe24481a30d37eb91efd62666165acf379c2/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L3027>.
I think this is exactly what I am asking for with
"refresh-start-timestamp-ms".
Walaa, no, the suggestion is to not have a grace period as
this is engine specific on how it wants to handle staleness.
"refresh-start-timestamp-ms" refers to 1 and not 2. We
should already have 2 in the snapshot summary timestamp-ms
property.
When I say "fresh as of", I don't mean the AS OF construct.
It's just making a guarantee to the consumer of the MV that
the materialization contains data that is "fresh as of" a
certain timestamp. So like, if you built a materialization on
top of 100 tables (possibly a mix of Iceberg and non-Iceberg)
and you know that the refresh job ran on say 6/20/2024
12:02:10 UTC, then whatever data is in the materialization has
to be "fresh as of" 6/20/2024 12:02:10 UTC.
Thanks
Benny
On Thu, Jun 20, 2024 at 11:19 AM Walaa Eldin Moustafa
<[email protected]> wrote:
Benny, is the suggestion to couple the
"refresh-start-timestamp-ms" property with a grace period
as well? Also, could you clarify which timestamp
"refresh-start-timestamp-ms" refers to:
(1) Timestamp when refresh is triggered
(2) Timestamp when refresh is concluded and the snapshot
is written.
Also when you say "fresh as of" this timestamp, do you
mean "AS OF" construct when used to query the materialized
view? Or something else? If "AS OF" is what you meant,
then this might answer my question about the grace period,
where it won't be needed.
Thanks,
Walaa.
On Thu, Jun 20, 2024 at 5:22 AM Piotr Findeisen
<[email protected]> wrote:
Hi Benny,
on the staleness topic I'd recommend to check how
Trino implements materialized views in Iceberg and how
it defines staleness.
In particular
- a view can have defined grace period which defines
how stale the data can be for the materialization to
be considered useful (defaults to unlimited)
- staleness clock starts with the first table change
after refresh
- for unmanaged (non-iceberg) tables where we don't
know when the table changed, the staleness clock
starts right after refresh
Best
Piotr
On Wed, 19 Jun 2024 at 19:58, Benny Chow
<[email protected]> wrote:
Hey Guys,
Great progress on the MV spec and thanks a ton to
Jan and Walaa for driving this. One of our latest
achievements was that we finalized the view
lineage and materialization table refresh JSON so
that we can definitively and concisely describe
what data is in the materialization table.
Regarding the actual refresh process, I have two
more suggestions:
*
*
*_When should a MV be refreshed?_ *There could be
many different refresh policies such as "on table
data or view change", periodic, scheduled and/or
manual with the goal of reducing staleness while
minimizing cost to refresh. I don't think we
should try to capture this configuration as part
of the first iteration of the MV spec. So, I
suggest we just remove the
"*/materialization.data.max-staleness/*" view
property for now. There's a lot of comments on
this in the spec and many contributors did suggest
to not include it as well.
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA
_When refresh is done, what information is stored
to help engines evaluate materialization
freshness?_ We agreed on storing the view lineage
and materialization refresh-tables so that engines
can query for the current table snapshot versions
and view versions and compare to what is stored in
the refresh-tables. However, there could be 100s
of tables/views here and it could be prohibitively
expensive to do this check. Instead, the engine
may just use the materialization's snapshot
summary timestamp-ms to determine the last refresh
time and assume the data is fresh as of this
timestamp. However, this assumption might be
naive if the refresh job took 1 hour to run and
source tables were queried at different times
throughout the execution of the job. So, I
propose we add a "*/refresh-start-timestamp-ms/*"
to the materialization snapshot summary which
tells users that the data in the materialization
is at least as fresh as of this date (It might be
fresher but not more stale).
Thoughts?
Thanks
Benny