Looks good Jan. I'm a bit nit pick on picking good names so I left some comments around that to see what others think.
Thanks On Fri, Jun 7, 2024 at 2:26 AM Jan Kaul <[email protected]> wrote: > Thanks Benny and Walaa for your input. I updated the doc > <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing> > to account for the changes as far as I understood. I would appreciate if > you had a look and give me some feedback. > > If you have some open comments that are not relevant anymore due to the > changes, please close them so that we can clean up the comments section a > bit. > > Regards, > > Jan > On 07.06.24 08:33, Walaa Eldin Moustafa wrote: > > * lineage state JSON structure > > On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa < > [email protected]> wrote: > >> Hi Benny, >> >> Your understanding is correct. >> >> Another point that we discussed was the type of APIs engines can use to >> conveniently update the storage table with view query results as well as >> set the snapshot summary on the output snapshot (one that was produced by >> the update). We will follow up on that separately. >> >> Jan, do you want to reflect the lineage + state discussion in the doc >> so we can iterate on the lineage JSON structure? >> >> Thanks, >> Walaa. >> >> >> On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <[email protected]> wrote: >> >>> I really enjoyed listening to the replay and hearing everyone's >>> feedback! I'm in agreement with all 3 consensus items, especially around >>> Dan's idea to separate the view's query tree lineage vs >>> materialization's lineage state. >>> >>> I'll summarize my understanding about the distinction and add a few >>> comments: >>> >>> Materialized View's Query Tree Lineage >>> - It's basically the SQL representation converted to a distinct list of >>> tables and views. >>> - Stored inside view versions so if you change the view SQL, you can >>> include the lineage with that change. >>> - Tables support time travel so they can optionally include a ref type >>> and name/timestamp >>> - Views would NOT include the version (that's part of the >>> materialization lineage state below) >>> - I think we should use fully qualified identifiers here instead of >>> UUIDs. Dropping and re-creating a referenced table or view doesn't break >>> the view SQL so the lineage should not be broken either. I also don't >>> think we can support time travel if we used table UUIDs here. >>> - Each table or view can be assigned a unique sequence number. This >>> sequence number is scoped to a single view version. >>> >>> Materialization Lineage State >>> - It's basically a lookup table for the above sequence number to either >>> a table snapshot id or view version that was used at the time of >>> creating/refreshing the storage table. For views, these are nested views >>> within the MV's query tree - not the MV itself. >>> - Stored inside the table's snapshot summary >>> - Additional property "refresh-version-id" to identify the MV's version. >>> >>> In order to validate the freshness of a materialization, everything >>> above has to be checked against the latest tables and views. This should >>> cover all data and query tree changes (that I can think of) such as the >>> "limit 100" example I gave in Slack >>> <https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL> >>> . >>> >>> Please let me know your thoughts. >>> >>> Thanks >>> >>> On Thu, Jun 6, 2024 at 7:53 AM <[email protected]> wrote: >>> >>>> Thanks for hosting it was a very helpful meeting. I really hope we can >>>> do more in the future to accelerate consensus on other proposals. >>>> >>>> >>>> I do encourage anyone on the mailing list to add your comments offline >>>> as well, especially if you have strong feelings. Iceberg is an open project >>>> and we realize not everyone can attend virtual meetings and want you to >>>> know you are welcome. >>>> >>>> >>>> >>>> On Jun 6, 2024, at 7:11 AM, Jan Kaul <[email protected]> >>>> <[email protected]> wrote: >>>> >>>> >>>> >>>> Hi all, >>>> >>>> thanks to all of you who attended the meeting yesterday! It was great >>>> to talk to you and I think we made great progress. For those of you who >>>> weren't able to attend the meeting, I summarized the main points below: >>>> >>>> * Question 1*: Should we store the "storage table pointer" as a view >>>> property or as additional field in the view metadata? >>>> >>>> We reached consensus to add a *new metadata field* "storage-table" to >>>> the view version <https://iceberg.apache.org/view-spec/#versions> >>>> record that stores the identifier of the the storage table. The motivation >>>> for introducing a new field is that this emphasizes that materialized views >>>> are part of the standard and it enforces a common behavior. >>>> >>>> *Question 2*: Where should the lineage-state information be stored? >>>> >>>> We reached consensus on storing the lineage-state information in the >>>> *snapshot >>>> summary* of the storage table. The motivation behind this is that the >>>> table spec should not be concerned with defining view constructs. >>>> >>>> *Question 3*: How should the lineage-state information be represented? >>>> >>>> We reached consensus on representing the lineage-state in the form of >>>> nested objects and storing these as a *JSON-encoded string* inside the >>>> storage table snapshot summary. >>>> >>>> Additionally, Dan proposed to introduce a new lineage construct as part >>>> of the view definition in addition to the lineage-state that is part of the >>>> storage table. The idea is to separate the concerns. The lineage-state in >>>> the storage table should only capture the state of the source tables at the >>>> time of the last refresh, whereas the lineage information in the view >>>> contains more information about the source tables and is responsible for >>>> resolving the identifiers. We haven't really decided on how the new lineage >>>> construct should be represented or integrated into the view metadata. >>>> >>>> One point that we didn't really have the time to discuss was Benny's >>>> comment of also storing the version-id of views in the case that the >>>> materialized view is referencing a view. I think we should also integrate >>>> that into the spec. >>>> >>>> You can find the recording of the meeting here: >>>> >>>> >>>> https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing >>>> >>>> Best wishes, >>>> >>>> Jan >>>> >>>>
