Looks good Jan.  I'm a bit nit pick on picking good names so I left some
comments around that to see what others think.

Thanks

On Fri, Jun 7, 2024 at 2:26 AM Jan Kaul <jank...@mailbox.org.invalid> wrote:

> Thanks Benny and Walaa for your input. I updated the doc
> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
> to account for the changes as far as I understood. I would appreciate if
> you had a look and give me some feedback.
>
> If you have some open comments that are not relevant anymore due to the
> changes, please close them so that we can clean up the comments section a
> bit.
>
> Regards,
>
> Jan
> On 07.06.24 08:33, Walaa Eldin Moustafa wrote:
>
> * lineage state JSON structure
>
> On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Hi Benny,
>>
>> Your understanding is correct.
>>
>> Another point that we discussed was the type of APIs engines can use to
>> conveniently update the storage table with view query results as well as
>> set the snapshot summary on the output snapshot (one that was produced by
>> the update). We will follow up on that separately.
>>
>> Jan, do you want to reflect the lineage + state discussion in the doc
>> so we can iterate on the lineage JSON structure?
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <btc...@gmail.com> wrote:
>>
>>> I really enjoyed listening to the replay and hearing everyone's
>>> feedback!  I'm in agreement with all 3 consensus items, especially around
>>> Dan's idea to separate the view's query tree lineage vs
>>> materialization's lineage state.
>>>
>>> I'll summarize my understanding about the distinction and add a few
>>> comments:
>>>
>>> Materialized View's Query Tree Lineage
>>> - It's basically the SQL representation converted to a distinct list of
>>> tables and views.
>>> - Stored inside view versions so if you change the view SQL, you can
>>> include the lineage with that change.
>>> - Tables support time travel so they can optionally include a ref type
>>> and name/timestamp
>>> - Views would NOT include the version (that's part of the
>>> materialization lineage state below)
>>> - I think we should use fully qualified identifiers here instead of
>>> UUIDs.  Dropping and re-creating a referenced table or view doesn't break
>>> the view SQL so the lineage should not be broken either.  I also don't
>>> think we can support time travel if we used table UUIDs here.
>>> - Each table or view can be assigned a unique sequence number.  This
>>> sequence number is scoped to a single view version.
>>>
>>> Materialization Lineage State
>>> - It's basically a lookup table for the above sequence number to either
>>> a table snapshot id or view version that was used at the time of
>>> creating/refreshing the storage table.  For views, these are nested views
>>> within the MV's query tree - not the MV itself.
>>> - Stored inside the table's snapshot summary
>>> - Additional property "refresh-version-id" to identify the MV's version.
>>>
>>> In order to validate the freshness of a materialization, everything
>>> above has to be checked against the latest tables and views.  This should
>>> cover all data and query tree changes (that I can think of) such as the
>>> "limit 100" example I gave in Slack
>>> <https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>
>>> .
>>>
>>> Please let me know your thoughts.
>>>
>>> Thanks
>>>
>>> On Thu, Jun 6, 2024 at 7:53 AM <russell.spit...@gmail.com> wrote:
>>>
>>>> Thanks for hosting it was a very helpful meeting. I really hope we can
>>>> do more in the future to accelerate consensus on other proposals.
>>>>
>>>>
>>>>  I do encourage anyone on the mailing list to add your comments offline
>>>> as well, especially if you have strong feelings. Iceberg is an open project
>>>> and we realize not everyone can attend virtual meetings and want you to
>>>> know you are welcome.
>>>>
>>>>
>>>>
>>>> On Jun 6, 2024, at 7:11 AM, Jan Kaul <jank...@mailbox.org.invalid>
>>>> <jank...@mailbox.org.invalid> wrote:
>>>>
>>>> 
>>>>
>>>> Hi all,
>>>>
>>>> thanks to all of you who attended the meeting yesterday! It was great
>>>> to talk to you and I think we made great progress. For those of you who
>>>> weren't able to attend the meeting, I summarized the main points below:
>>>>
>>>> * Question 1*: Should we store the "storage table pointer" as a view
>>>> property or as additional field in the view metadata?
>>>>
>>>> We reached consensus to add a *new metadata field* "storage-table" to
>>>> the view version <https://iceberg.apache.org/view-spec/#versions>
>>>> record that stores the identifier of the the storage table. The motivation
>>>> for introducing a new field is that this emphasizes that materialized views
>>>> are part of the standard and it enforces a common behavior.
>>>>
>>>> *Question 2*: Where should the lineage-state information be stored?
>>>>
>>>> We reached consensus on storing the lineage-state information in the 
>>>> *snapshot
>>>> summary* of the storage table. The motivation behind this is that the
>>>> table spec should not be concerned with defining view constructs.
>>>>
>>>> *Question 3*: How should the lineage-state information be represented?
>>>>
>>>> We reached consensus on representing the lineage-state in the form of
>>>> nested objects and storing these as a *JSON-encoded string* inside the
>>>> storage table snapshot summary.
>>>>
>>>> Additionally, Dan proposed to introduce a new lineage construct as part
>>>> of the view definition in addition to the lineage-state that is part of the
>>>> storage table. The idea is to separate the concerns. The lineage-state in
>>>> the storage table should only capture the state of the source tables at the
>>>> time of the last refresh, whereas the lineage information in the view
>>>> contains more information about the source tables and is responsible for
>>>> resolving the identifiers. We haven't really decided on how the new lineage
>>>> construct should be represented or integrated into the view metadata.
>>>>
>>>> One point that we didn't really have the time to discuss was Benny's
>>>> comment of also storing the version-id of views in the case that the
>>>> materialized view is referencing a view. I think we should also integrate
>>>> that into the spec.
>>>>
>>>> You can find the recording of the meeting here:
>>>>
>>>>
>>>> https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>>
>>>>

Reply via email to