Thanks Amogh. Looks like the recording for last week's sync is available on Youtube. Here's the link, https://www.youtube.com/watch?v=uWm-p--8oVQ
Best, Kevin Liu On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <[email protected]> wrote: > Hey folks, > > Just following up on this to give the community as to where we're at and > my proposed next steps. > > I've been editing and merging the contents from our proposal into the > proposal > <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw> > from > Russell and others. For any future comments on docs, please comment on the > linked proposal. I've also marked it on our doc in red text so it's clear > to redirect to the other proposal as a source of truth for comments. > > In terms of next steps, > > 1. An important design decision point is around inline manifest DVs, > external manifest DVs or enabling both. I'm working on measuring different > approaches for representing the compressed DV representation since that > will inform how many entries can reasonably fit in a small root manifest; > from that we can derive implications on different write patterns and > determine the right approach for storing these manifest DVs. > > 2. Another key point is around determining if/how we can reasonably enable > V4 to represent changes in the root manifest so that readers can > effectively just infer file level changes from the root. > > 3. One of the aspects of the proposal is getting away from partition tuple > requirement in the root which currently holds us to have associativity > between a partition spec and a manifest. These aspects can be modeled as > essentially column stats which gives a lot of flexibility into the > organization of the manifest. There are important details around field ID > spaces here which tie into how the stats are structured. What we're > proposing here is to have a unified expression ID space that could also > benefit us for storing things like virtual columns down the line. I go into > this in the proposal but I'm working on separating the appropriate parts so > that the original proposal can mostly just focus on the organization of the > content metadata tree and not how we want to solve this particular ID space > problem. > > 4. I'm planning on scheduling a recurring community sync starting next > Tuesday at 9am PST, every 2 weeks. If I get feedback from folks that this > time will never work, I can certainly adjust. For some reason, I don't have > the ability to add to the Iceberg Dev calendar, so I'll figure that out and > update the thread when the event is scheduled. > > Thanks, > > Amogh Jahagirdar > > On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer < > [email protected]> wrote: > >> I think this is a great way forward, starting out with this much parallel >> development shows that we have a lot of consensus already :) >> >> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <[email protected]> >> wrote: >> >>> Hey folks, just following up on this. It looks like our proposal and the >>> proposal that @Russell Spitzer <[email protected]> shared are >>> pretty aligned. I was just chatting with Russell about this, and we think >>> it'd be best to combine both proposals and have a singular large effort on >>> this. I can also set up a focused community discussion (similar to what >>> we're doing on the other V4 proposals) on this starting sometime next week >>> just to get things moving, if that works for people. >>> >>> Thanks, >>> >>> Amogh Jahagirdar >>> >>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <[email protected]> >>> wrote: >>> >>>> Hey Russell, >>>> >>>> Thanks for sharing the proposal! A few of us (Ryan, Dan, Anoop and I) >>>> have also been working on a proposal for an adaptive metadata tree >>>> structure as part of enabling more efficient one file commits. From a read >>>> of the summary, it's great to see that we're thinking along the same lines >>>> about how to tackle this fundamental area! >>>> >>>> Here is our proposal: >>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0 >>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0> >>>> >>>> Thanks, >>>> Amogh Jahagirdar >>>> >>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer < >>>> [email protected]> wrote: >>>> >>>>> Hey y'all! >>>>> >>>>> We (Yi Fang, Steven Wu and Myself) wanted to share some >>>>> of the thoughts we had on how one-file commits could work in Iceberg. >>>>> This is pretty >>>>> much just a high level overview of the concepts we think we need and >>>>> how Iceberg would behave. >>>>> We haven't gone very far into the actual implementation and changes >>>>> that would need to occur in the >>>>> SDK to make this happen. >>>>> >>>>> The high level summary is: >>>>> >>>>> Manifest Lists are out >>>>> Root Manifests take their place >>>>> A Root manifest can have data manifests, delete manifests, manifest >>>>> delete vectors, data delete vectors and data files >>>>> Manifest delete vectors allow for modifying a manifest without >>>>> deleting it entirely >>>>> Data files let you append without writing an intermediary manifest >>>>> Having child data and delete manifests lets you still scale >>>>> >>>>> Please take a look if you like, >>>>> >>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0 >>>>> >>>>> I'm excited to see what other proposals and Ideas are floating around >>>>> the community, >>>>> Russ >>>>> >>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <[email protected]> wrote: >>>>> >>>>>> Very excited about the idea! >>>>>> >>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I'm very interested in this initiative. Micah Kornfield and I >>>>>>> presented <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405> >>>>>>> on high-throughput ingestion for Iceberg tables at the 2024 Iceberg >>>>>>> Summit, >>>>>>> which leveraged Google infrastructure like Colossus for efficient >>>>>>> appends. >>>>>>> >>>>>>> This new proposal is particularly exciting because it offers >>>>>>> significant advancements in commit latency and metadata storage >>>>>>> footprint. >>>>>>> Furthermore, a consistent manifest structure promises to simplify the >>>>>>> design and codebase, which is a major benefit. >>>>>>> >>>>>>> A related idea I've been exploring is having a loose affinity >>>>>>> between data and delete manifests. While the current separation of data >>>>>>> and >>>>>>> delete manifests in Iceberg is valuable for avoiding data file rewrites >>>>>>> (and stats updates) when deletes change, it does necessitate a join >>>>>>> operation during reads. I'd be keen to discuss approaches that could >>>>>>> potentially reduce this read-side cost while retaining the benefits of >>>>>>> separate manifests. >>>>>>> >>>>>>> Best, >>>>>>> Anoop >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I am new to the Iceberg community but would love to participate in >>>>>>>> these discussions to reduce the number of file writes, especially for >>>>>>>> small >>>>>>>> writes/commits. >>>>>>>> >>>>>>>> Thank you! >>>>>>>> -Jagdeep >>>>>>>> >>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag Mantripragada >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>>> We have been hitting all the metadata problems you mentioned, >>>>>>>>> Ryan. I’m on-board to help however I can to improve this area. >>>>>>>>> >>>>>>>>> >>>>>>>>> ~ Anurag Mantripragada >>>>>>>>> >>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> I am interested in this idea and looking forward to collaboration. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Huang-Hsiang >>>>>>>>> >>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I am interested in contributing to this effort. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Namratha >>>>>>>>> >>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for kicking this thread off Ryan, I'm interested in >>>>>>>>>> helping out here! I've been working on a proposal in this area and >>>>>>>>>> it would >>>>>>>>>> be great to collaborate with different folks and exchange ideas >>>>>>>>>> here, since >>>>>>>>>> I think a lot of people are interested in solving this problem. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Amogh Jahagirdar >>>>>>>>>> >>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> Like Russell’s recent note, I’m starting a thread to connect >>>>>>>>>>> those of us that are interested in the idea of changing Iceberg’s >>>>>>>>>>> metadata >>>>>>>>>>> in v4 so that in most cases committing a change only requires >>>>>>>>>>> writing one >>>>>>>>>>> additional metadata file. >>>>>>>>>>> >>>>>>>>>>> *Idea: One-file commits* >>>>>>>>>>> >>>>>>>>>>> The current Iceberg metadata structure requires writing at least >>>>>>>>>>> one manifest and a new manifest list to produce a new snapshot. The >>>>>>>>>>> goal of >>>>>>>>>>> this work is to allow more flexibility by allowing the manifest >>>>>>>>>>> list layer >>>>>>>>>>> to store data and delete files. As a result, only one file write >>>>>>>>>>> would be >>>>>>>>>>> needed before committing the new snapshot. In addition, this work >>>>>>>>>>> will also >>>>>>>>>>> try to explore: >>>>>>>>>>> >>>>>>>>>>> - Avoiding small manifests that must be read in parallel and >>>>>>>>>>> later compacted (metadata maintenance changes) >>>>>>>>>>> - Extend metadata skipping to use aggregated column ranges >>>>>>>>>>> that are compatible with geospatial data (manifest metadata) >>>>>>>>>>> - Using soft deletes to avoid rewriting existing manifests >>>>>>>>>>> (metadata DVs) >>>>>>>>>>> >>>>>>>>>>> If you’re interested in these problems, please reply! >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> -- >>>>>> John Zhuge >>>>>> >>>>>
