Thanks for updating the diagram and +1 to Fokko's suggestion. On Fri, Nov 3, 2023 at 3:43 PM Fokko Driesprong <fo...@apache.org> wrote:
> Hey Jason, thanks for updating the chart. > > I like it a lot. However, there are a lot of boxes and new terms. What do > you think of keeping both files, and indicating that the old applies to V1 > tables, and the new one to V2 tables. > > Kind regards, > Fokko > > Op vr 3 nov 2023 om 14:37 schreef Aaron Niskode-Dossett > <aniskodedoss...@etsy.com.invalid>: > >> An update would be greatly appreciated, thank you! >> >> On Thu, Nov 2, 2023 at 12:42 PM Jason Hughes <ja...@dremio.com.invalid> >> wrote: >> >>> Hey all, >>> >>> The current architecture diagram >>> <https://iceberg.apache.org/img/iceberg-metadata.png> for an iceberg >>> table hasn't been updated in over 3 years, and there's are some aspects to >>> the architecture of an iceberg table that have changed, most notably delete >>> files and puffin files. since this diagram gets a lot of use in enablement >>> content around the community and isn't totally accurate anymore, @Ajantha >>> Bhat U <ajantha.bh...@dremio.com> and I discussed updating it to be >>> more accurate >>> >>> here's an updated version of the diagram >>> <https://docs.google.com/drawings/d/1m_iiJIJjiymadFIsCYnuUS6BvFo0MYDPCx0kKhZgIx4/edit> >>> we put together >>> >>> a few points for discussion that we're interested in others' thoughts on: >>> >>> 1. the diagram is obviously somewhat more visually complicated than >>> the current one, but IMO the benefit of being more accurate for people >>> learning iceberg outweighs the additional complexity >>> 2. since the partition stats spec PR >>> <https://github.com/apache/iceberg/pull/7105> just got merged, we >>> thought it'd be good to include that too while we're updating it, and >>> combine puffin files with partition stats files into one category of >>> files >>> in the diagram labeled "statistics files". we combined them in the >>> diagram, >>> rather than splitting them up, because 1. it provides a simpler diagram, >>> 2. >>> gets the primary point across, and 3. they both serve the purpose of >>> providing statistics for tools to leverage (albeit for different use >>> cases) >>> 3. we put statistics files in place in the diagram for both s0 and >>> s1, though we could only have statistics files for s1, which would 1. >>> make >>> the diagram simpler, and 2. show a simple example of the use case of not >>> needing stats files initially, but then as data grows and/or query >>> patterns >>> change, now stats files are needed >>> >>> if folks are on board with updating the diagram, and after we come to a >>> conclusion on the above discussion points and any others that come up, I >>> can export it to a png and create a PR to update the arch diagram image on >>> the site >>> >>> thanks! >>> >>> >>> Jason Hughes >>> >>> >>> Dremio | Director of Technical Advocacy >>> >>> >>> >>> >>> >> >> -- >> Aaron Niskode-Dossett, Data Engineering -- Etsy >> >