Hey Guys,

Great progress on the MV spec and thanks a ton to Jan and Walaa for
driving this.  One of our latest achievements was that we finalized the
view lineage and materialization table refresh JSON so that we can
definitively and concisely describe what data is in the materialization
table.

Regarding the actual refresh process, I have two more suggestions:

*When should a MV be refreshed?  *There could be many different refresh
policies such as "on table data or view change", periodic, scheduled and/or
manual with the goal of reducing staleness while minimizing cost to
refresh.  I don't think we should try to capture this configuration as part
of the first iteration of the MV spec.  So, I suggest we just remove the "
*materialization.data.max-staleness*" view property for now.  There's a lot
of comments on this in the spec and many contributors did suggest to not
include it as well.
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA

*When refresh is done, what information is stored to help engines evaluate
materialization freshness?*  We agreed on storing the view lineage and
materialization refresh-tables so that engines can query for the current
table snapshot versions and view versions and compare to what is stored in
the refresh-tables.  However, there could be 100s of tables/views here and
it could be prohibitively expensive to do this check.  Instead, the engine
may just use the materialization's snapshot summary timestamp-ms to
determine the last refresh time and assume the data is fresh as of this
timestamp.  However, this assumption might be naive if the refresh job took
1 hour to run and source tables were queried at different times throughout
the execution of the job.  So, I propose we add a "
*refresh-start-timestamp-ms*" to the materialization snapshot summary which
tells users that the data in the materialization is at least as fresh as of
this date  (It might be fresher but not more stale).

Thoughts?

Thanks
Benny

Reply via email to