Re: Iceberg MV Refresh

Jan Kaul Thu, 20 Jun 2024 21:47:30 -0700

Thanks Benny for bringing these issues up. I would agree with both ofyour propositions.

Regarding the naming of the fields, we can go with the naming that yousuggested. I just wanted to wait if some more people chime in with theiropinions.


Jan

On 20.06.24 23:16, Benny Chow wrote:

> So basically this is just FYI and it is up to the consumer to assumewhat to do given the length of time between that timestamp and now?Yes.. some consumers will use it.. some won't. Only the engineproducing the materialization will know when it started the refresh job.

> Would the decision be on the engine implementation side (which doesnot have the domain knowledge), or would it be on the user's side?I think the decision would mostly be on the engine implementationside. I would imagine an engine would have a bunch of configurationproperties associated with a view that control how/when the refresh isperformed and settings around when the planner can still use theview's materialization when it is stale. It might be possible for theuser to specify hints in the query or session/connection properties tocontrol the staleness tolerance too.


Benny

On Thu, Jun 20, 2024 at 12:17 PM Walaa Eldin Moustafa<wa.moust...@gmail.com> wrote:


    So basically this is just FYI and it is up to the consumer to
    assume what to do given the length of time between that timestamp
    and now? Would the decision be on the engine implementation side
    (which does not have the domain knowledge), or would it be on the
    user's side? In the latter case the user would need to explicitly
    query the storage table directly, correct? With a grace period I
    think we could push it down to the engine.

    Thanks,
    Walaa.


    On Thu, Jun 20, 2024 at 12:06 PM Benny Chow <btc...@gmail.com> wrote:

        Piotr, thanks for the Trino pointers.  I noticed that Trino
        stores the refresh start time as a snapshot summary property
        here
        
<https://github.com/trinodb/trino/blob/6697fe24481a30d37eb91efd62666165acf379c2/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L3027>.
 
        I think this is exactly what I am asking for with
        "refresh-start-timestamp-ms".

        Walaa, no, the suggestion is to not have a grace period as
        this is engine specific on how it wants to handle staleness. 
         "refresh-start-timestamp-ms" refers to 1 and not 2.  We
        should already have 2 in the snapshot summary timestamp-ms
        property.

        When I say "fresh as of", I don't mean the AS OF construct. 
        It's just making a guarantee to the consumer of the MV that
        the materialization contains data that is "fresh as of" a
        certain timestamp.  So like, if you built a materialization on
        top of 100 tables (possibly a mix of Iceberg and non-Iceberg)
        and you know that the refresh job ran on say 6/20/2024
        12:02:10 UTC, then whatever data is in the materialization has
        to be "fresh as of" 6/20/2024 12:02:10 UTC.

        Thanks
        Benny




        On Thu, Jun 20, 2024 at 11:19 AM Walaa Eldin Moustafa
        <wa.moust...@gmail.com> wrote:

            Benny, is the suggestion to couple the
            "refresh-start-timestamp-ms" property with a grace period
            as well? Also, could you clarify which timestamp
            "refresh-start-timestamp-ms" refers to:
            (1) Timestamp when refresh is triggered
            (2) Timestamp when refresh is concluded and the snapshot
            is written.

            Also when you say "fresh as of" this timestamp, do you
            mean "AS OF" construct when used to query the materialized
            view? Or something else? If "AS OF" is what you meant,
            then this might answer my question about the grace period,
            where it won't be needed.

            Thanks,
            Walaa.



            On Thu, Jun 20, 2024 at 5:22 AM Piotr Findeisen
            <piotr.findei...@gmail.com> wrote:

                Hi Benny,

                on the staleness topic I'd recommend to check how
                Trino implements materialized views in Iceberg and how
                it defines staleness.
                In particular

                - a view can have defined grace period which defines
                how stale the data can be for the materialization to
                be considered useful (defaults to unlimited)
                - staleness clock starts with the first table change
                after refresh
                - for unmanaged (non-iceberg) tables where we don't
                know when the table changed, the staleness clock
                starts right after refresh

                Best
                Piotr





                On Wed, 19 Jun 2024 at 19:58, Benny Chow
                <btc...@gmail.com> wrote:

                    Hey Guys,

                    Great progress on the MV spec and thanks a ton to
                    Jan and Walaa for driving this.  One of our latest
                    achievements was that we finalized the view
                    lineage and materialization table refresh JSON so
                    that we can definitively and concisely describe
                    what data is in the materialization table.

                    Regarding the actual refresh process, I have two
                    more suggestions:
                    *
                    *
                    *_When should a MV be refreshed?_ *There could be
                    many different refresh policies such as "on table
                    data or view change", periodic, scheduled and/or
                    manual with the goal of reducing staleness while
                    minimizing cost to refresh.  I don't think we
                    should try to capture this configuration as part
                    of the first iteration of the MV spec.  So, I
                    suggest we just remove the
                    "*/materialization.data.max-staleness/*" view
                    property for now.  There's a lot of comments on
                    this in the spec and many contributors did suggest
                    to not include it as well.
                    
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA

                    _When refresh is done, what information is stored
                    to help engines evaluate materialization
                    freshness?_  We agreed on storing the view lineage
                    and materialization refresh-tables so that engines
                    can query for the current table snapshot versions
                    and view versions and compare to what is stored in
                    the refresh-tables.  However, there could be 100s
                    of tables/views here and it could be prohibitively
                    expensive to do this check.  Instead, the engine
                    may just use the materialization's snapshot
                    summary timestamp-ms to determine the last refresh
                    time and assume the data is fresh as of this
                    timestamp.  However, this assumption might be
                    naive if the refresh job took 1 hour to run and
                    source tables were queried at different times
                    throughout the execution of the job.  So, I
                    propose we add a "*/refresh-start-timestamp-ms/*"
                    to the materialization snapshot summary which
                    tells users that the data in the materialization
                    is at least as fresh as of this date  (It might be
                    fresher but not more stale).

                    Thoughts?

                    Thanks
                    Benny

Re: Iceberg MV Refresh

Reply via email to