Thanks for updating the diagram and +1 to Fokko's suggestion.

On Fri, Nov 3, 2023 at 3:43 PM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Jason, thanks for updating the chart.
>
> I like it a lot. However, there are a lot of boxes and new terms. What do
> you think of keeping both files, and indicating that the old applies to V1
> tables, and the new one to V2 tables.
>
> Kind regards,
> Fokko
>
> Op vr 3 nov 2023 om 14:37 schreef Aaron Niskode-Dossett
> <aniskodedoss...@etsy.com.invalid>:
>
>> An update would be greatly appreciated, thank you!
>>
>> On Thu, Nov 2, 2023 at 12:42 PM Jason Hughes <ja...@dremio.com.invalid>
>> wrote:
>>
>>> Hey all,
>>>
>>> The current architecture diagram
>>> <https://iceberg.apache.org/img/iceberg-metadata.png> for an iceberg
>>> table hasn't been updated in over 3 years, and there's are some aspects to
>>> the architecture of an iceberg table that have changed, most notably delete
>>> files and puffin files. since this diagram gets a lot of use in enablement
>>> content around the community and isn't totally accurate anymore, @Ajantha
>>> Bhat U <ajantha.bh...@dremio.com> and I discussed updating it to be
>>> more accurate
>>>
>>> here's an updated version of the diagram
>>> <https://docs.google.com/drawings/d/1m_iiJIJjiymadFIsCYnuUS6BvFo0MYDPCx0kKhZgIx4/edit>
>>> we put together
>>>
>>> a few points for discussion that we're interested in others' thoughts on:
>>>
>>>    1. the diagram is obviously somewhat more visually complicated than
>>>    the current one, but IMO the benefit of being more accurate for people
>>>    learning iceberg outweighs the additional complexity
>>>    2. since the partition stats spec PR
>>>    <https://github.com/apache/iceberg/pull/7105> just got merged, we
>>>    thought it'd be good to include that too while we're updating it, and
>>>    combine puffin files with partition stats files into one category of 
>>> files
>>>    in the diagram labeled "statistics files". we combined them in the 
>>> diagram,
>>>    rather than splitting them up, because 1. it provides a simpler diagram, 
>>> 2.
>>>    gets the primary point across, and 3. they both serve the purpose of
>>>    providing statistics for tools to leverage (albeit for different use 
>>> cases)
>>>    3. we put statistics files in place in the diagram for both s0 and
>>>    s1, though we could only have statistics files for s1, which would 1. 
>>> make
>>>    the diagram simpler, and 2. show a simple example of the use case of not
>>>    needing stats files initially, but then as data grows and/or query 
>>> patterns
>>>    change, now stats files are needed
>>>
>>> if folks are on board with updating the diagram, and after we come to a
>>> conclusion on the above discussion points and any others that come up, I
>>> can export it to a png and create a PR to update the arch diagram image on
>>> the site
>>>
>>> thanks!
>>>
>>>
>>> Jason Hughes
>>>
>>>
>>> Dremio | Director of Technical Advocacy
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Aaron Niskode-Dossett, Data Engineering -- Etsy
>>
>

Reply via email to