Re: [DISCUSS] v4 - One file commits

Amogh Jahagirdar Sat, 18 Oct 2025 12:50:01 -0700

Hey folks,

Sorry for the delay, here's the recording link
<https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view>  from
last week's discussion.


Thanks,
Amogh Jahagirdar

On Fri, Oct 10, 2025 at 9:44 AM Péter Váry <[email protected]>
wrote:

> Same here.
> Please record if you can.
> Thanks, Peter
>
> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong <[email protected]> wrote:
>
>> Hey Amogh,
>>
>> Thanks for the write-up. Unfortunately, I won’t be able to attend. Will
>> it be recorded? Thanks!
>>
>> Kind regards,
>> Fokko
>>
>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <[email protected]>
>>
>>> Hey all,
>>>
>>> I've setup time this Friday at 9am PST for another sync on single file
>>> commits. In terms of what would be great to focus on for the discussion:
>>>
>>> 1. Whether it makes sense or not to eliminate the tuple, and instead
>>> representing the tuple via lower/upper boundaries. As a reminder, one of
>>> the goals is to avoid tying a partition spec to a manifest; in the root we
>>> can have a mix of files spanning different partition specs, and even in
>>> leaf manifests avoiding this coupling can enable more desirable clustering
>>> of metadata.
>>> In the vast majority of cases, we could leverage the property that a
>>> file is effectively partitioned if the lower/upper for a given field is
>>> equal. The nuance here is with the particular case of identity partitioned
>>> string/binary columns which can be truncated in stats. One approach is to
>>> require that writers must not produce truncated stats for identity
>>> partitioned columns. It's also important to keep in mind that all of this
>>> is just for the purpose of reconstructing the partition tuple, which is
>>> only required during equality delete matching. Another area we need to
>>> cover as part of this is on exact bounds on stats. There are other options
>>> here as well such as making all new equality deletes in V4 be global and
>>> instead match based on bounds, or keeping the tuple but each tuple is
>>> effectively based off a union schema of all partition specs. I am adding a
>>> separate appendix section outlining the span of options here and the
>>> different tradeoffs.
>>> Once we get this more to a conclusive state, I'll move a summarized
>>> version to the main doc.
>>>
>>> 2. @[email protected] <[email protected]> has updated the doc
>>> with a section
>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>>  on
>>> how we can do change detection from the root in a variety of write
>>> scenarios. I've done a review on it, and it covers the cases I would
>>> expect. It'd be good for folks to take a look and please give feedback
>>> before we discuss. Thank you Steven for adding that section and all the
>>> diagrams.
>>>
>>> Thanks,
>>> Amogh Jahagirdar
>>>
>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> Hey folks just following up from the discussion last Friday with a
>>>> summary and some next steps:
>>>>
>>>> 1.) For the various change detection cases, we concluded it's best just
>>>> to go through those in an offline manner on the doc since it's hard to
>>>> verify all that correctness in a large meeting setting.
>>>> 2.) We mostly discussed eliminating the partition tuple. On the
>>>> original proposal, I was mostly aiming for the ability to re-constructing
>>>> the tuple from the stats for the purpose of equality delete matching (a
>>>> file is partitioned if the lower and upper bounds are equal); There's some
>>>> nuance in how we need to handle identity partition values since for
>>>> string/binary they cannot be truncated. Another potential option is to
>>>> treat all equality deletes as effectively global and narrow their
>>>> application based on the stats values. This may require defining tight
>>>> bounds. I'm still collecting my thoughts on this one.
>>>>
>>>> Thanks folks! Please also let me know if any of the following links are
>>>> inaccessible for any reason.
>>>>
>>>> Meeting recording link:
>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>>> Meeting summary:
>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>>
>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <[email protected]>
>>>> wrote:
>>>>
>>>>> Update: I moved the discussion time to this Friday at 9 am PST since I
>>>>> found out that quite a few folks involved in the proposals will be out 
>>>>> next
>>>>> week, and I also know some folks will also be out the week after that.
>>>>>
>>>>> Thanks,
>>>>> Amogh J
>>>>>
>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey folks sorry for the late follow up here,
>>>>>>
>>>>>> Thanks @Kevin Liu <[email protected]> for sharing the recording
>>>>>> link of the previous discussion! I've set up another sync for next 
>>>>>> Tuesday
>>>>>> 09/16 at 9am PST. This time I've set it up from my corporate email so we
>>>>>> can get recordings and transcriptions (and I've made sure to keep the
>>>>>> meeting invite open so we don't have to manually let people in).
>>>>>>
>>>>>> In terms of next steps of areas which I think would be good to focus
>>>>>> on for establishing consensus:
>>>>>>
>>>>>> 1. How do we model the manifest entry structure so that changes to
>>>>>> manifest DVs can be obtained easily from the root? There are a few 
>>>>>> options
>>>>>> here; the most promising approach is to keep an additional DV which 
>>>>>> encodes
>>>>>> the diff in additional positions which have been removed from a leaf
>>>>>> manifest.
>>>>>>
>>>>>> 2. Modeling partition transforms via expressions and establishing a
>>>>>> unified table ID space so that we can simplify how partition tuples may 
>>>>>> be
>>>>>> represented via stats and also have a way in the future to store stats on
>>>>>> any derived column. I have a short proposal
>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>>  for
>>>>>> this that probably still needs some tightening up on the expression
>>>>>> modeling itself (and some prototyping) but the general idea for
>>>>>> establishing a unified table ID space is covered. All feedback welcome!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Amogh. Looks like the recording for last week's sync is
>>>>>>> available on Youtube. Here's the link,
>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>>
>>>>>>> Best,
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> Just following up on this to give the community as to where we're
>>>>>>>> at and my proposed next steps.
>>>>>>>>
>>>>>>>> I've been editing and merging the contents from our proposal into
>>>>>>>> the proposal
>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>>  from
>>>>>>>> Russell and others. For any future comments on docs, please comment on 
>>>>>>>> the
>>>>>>>> linked proposal. I've also marked it on our doc in red text so it's 
>>>>>>>> clear
>>>>>>>> to redirect to the other proposal as a source of truth for comments.
>>>>>>>>
>>>>>>>> In terms of next steps,
>>>>>>>>
>>>>>>>> 1. An important design decision point is around inline manifest
>>>>>>>> DVs, external manifest DVs or enabling both. I'm working on
>>>>>>>> measuring different approaches for representing the compressed DV
>>>>>>>> representation since that will inform how many entries can reasonably 
>>>>>>>> fit
>>>>>>>> in a small root manifest; from that we can derive implications on 
>>>>>>>> different
>>>>>>>> write patterns and determine the right approach for storing these 
>>>>>>>> manifest
>>>>>>>> DVs.
>>>>>>>>
>>>>>>>> 2. Another key point is around determining if/how we can reasonably
>>>>>>>> enable V4 to represent changes in the root manifest so that readers can
>>>>>>>> effectively just infer file level changes from the root.
>>>>>>>>
>>>>>>>> 3. One of the aspects of the proposal is getting away from
>>>>>>>> partition tuple requirement in the root which currently holds us to 
>>>>>>>> have
>>>>>>>> associativity between a partition spec and a manifest. These aspects 
>>>>>>>> can be
>>>>>>>> modeled as essentially column stats which gives a lot of flexibility 
>>>>>>>> into
>>>>>>>> the organization of the manifest. There are important details around 
>>>>>>>> field
>>>>>>>> ID spaces here which tie into how the stats are structured. What we're
>>>>>>>> proposing here is to have a unified expression ID space that could also
>>>>>>>> benefit us for storing things like virtual columns down the line. I go 
>>>>>>>> into
>>>>>>>> this in the proposal but I'm working on separating the appropriate 
>>>>>>>> parts so
>>>>>>>> that the original proposal can mostly just focus on the organization 
>>>>>>>> of the
>>>>>>>> content metadata tree and not how we want to solve this particular ID 
>>>>>>>> space
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> 4. I'm planning on scheduling a recurring community sync starting
>>>>>>>> next Tuesday at 9am PST, every 2 weeks. If I get feedback from folks 
>>>>>>>> that
>>>>>>>> this time will never work, I can certainly adjust. For some reason, I 
>>>>>>>> don't
>>>>>>>> have the ability to add to the Iceberg Dev calendar, so I'll figure 
>>>>>>>> that
>>>>>>>> out and update the thread when the event is scheduled.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Amogh Jahagirdar
>>>>>>>>
>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think this is a great way forward, starting out with this much
>>>>>>>>> parallel development shows that we have a lot of consensus already :)
>>>>>>>>>
>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hey folks, just following up on this. It looks like our proposal
>>>>>>>>>> and the proposal that @Russell Spitzer
>>>>>>>>>> <[email protected]> shared are pretty aligned. I was
>>>>>>>>>> just chatting with Russell about this, and we think it'd be best to 
>>>>>>>>>> combine
>>>>>>>>>> both proposals and have a singular large effort on this. I can also 
>>>>>>>>>> set up
>>>>>>>>>> a focused community discussion (similar to what we're doing on the 
>>>>>>>>>> other V4
>>>>>>>>>> proposals) on this starting sometime next week just to get things 
>>>>>>>>>> moving,
>>>>>>>>>> if that works for people.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Russell,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for sharing the proposal! A few of us (Ryan, Dan, Anoop
>>>>>>>>>>> and I) have also been working on a proposal for an adaptive 
>>>>>>>>>>> metadata tree
>>>>>>>>>>> structure as part of enabling more efficient one file commits. 
>>>>>>>>>>> >From a read
>>>>>>>>>>> of the summary, it's great to see that we're thinking along the 
>>>>>>>>>>> same lines
>>>>>>>>>>> about how to tackle this fundamental area!
>>>>>>>>>>>
>>>>>>>>>>> Here is our proposal:
>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>>
>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to share some
>>>>>>>>>>>> of the thoughts we had on how one-file commits could work in
>>>>>>>>>>>> Iceberg. This is pretty
>>>>>>>>>>>> much just a high level overview of the concepts we think we
>>>>>>>>>>>> need and how Iceberg would behave.
>>>>>>>>>>>> We haven't gone very far into the actual implementation and
>>>>>>>>>>>> changes that would need to occur in the
>>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>>
>>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>>
>>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>>   A Root manifest can have data manifests, delete manifests,
>>>>>>>>>>>> manifest delete vectors, data delete vectors and data files
>>>>>>>>>>>>   Manifest delete vectors allow for modifying a manifest
>>>>>>>>>>>> without deleting it entirely
>>>>>>>>>>>>   Data files let you append without writing an intermediary
>>>>>>>>>>>> manifest
>>>>>>>>>>>>   Having child data and delete manifests lets you still scale
>>>>>>>>>>>>
>>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>>
>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>>
>>>>>>>>>>>> I'm excited to see what other proposals and Ideas are floating
>>>>>>>>>>>> around the community,
>>>>>>>>>>>> Russ
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm very interested in this initiative. Micah Kornfield and I
>>>>>>>>>>>>>> presented
>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405> on
>>>>>>>>>>>>>> high-throughput ingestion for Iceberg tables at the 2024 Iceberg 
>>>>>>>>>>>>>> Summit,
>>>>>>>>>>>>>> which leveraged Google infrastructure like Colossus for 
>>>>>>>>>>>>>> efficient appends.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This new proposal is particularly exciting because it offers
>>>>>>>>>>>>>> significant advancements in commit latency and metadata storage 
>>>>>>>>>>>>>> footprint.
>>>>>>>>>>>>>> Furthermore, a consistent manifest structure promises to 
>>>>>>>>>>>>>> simplify the
>>>>>>>>>>>>>> design and codebase, which is a major benefit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A related idea I've been exploring is having a loose affinity
>>>>>>>>>>>>>> between data and delete manifests. While the current separation 
>>>>>>>>>>>>>> of data and
>>>>>>>>>>>>>> delete manifests in Iceberg is valuable for avoiding data file 
>>>>>>>>>>>>>> rewrites
>>>>>>>>>>>>>> (and stats updates) when deletes change, it does necessitate a 
>>>>>>>>>>>>>> join
>>>>>>>>>>>>>> operation during reads. I'd be keen to discuss approaches that 
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>> potentially reduce this read-side cost while retaining the 
>>>>>>>>>>>>>> benefits of
>>>>>>>>>>>>>> separate manifests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am new to the Iceberg community but would love to
>>>>>>>>>>>>>>> participate in these discussions to reduce the number of file 
>>>>>>>>>>>>>>> writes,
>>>>>>>>>>>>>>> especially for small writes/commits.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag Mantripragada
>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We have been hitting all the metadata problems you
>>>>>>>>>>>>>>>> mentioned, Ryan. I’m on-board to help however I can to improve 
>>>>>>>>>>>>>>>> this area.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng
>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am interested in this idea and looking forward to
>>>>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am interested in contributing to this effort.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan, I'm interested in
>>>>>>>>>>>>>>>>> helping out here! I've been working on a proposal in this 
>>>>>>>>>>>>>>>>> area and it would
>>>>>>>>>>>>>>>>> be great to collaborate with different folks and exchange 
>>>>>>>>>>>>>>>>> ideas here, since
>>>>>>>>>>>>>>>>> I think a lot of people are interested in solving this 
>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting a thread to
>>>>>>>>>>>>>>>>>> connect those of us that are interested in the idea of 
>>>>>>>>>>>>>>>>>> changing Iceberg’s
>>>>>>>>>>>>>>>>>> metadata in v4 so that in most cases committing a change 
>>>>>>>>>>>>>>>>>> only requires
>>>>>>>>>>>>>>>>>> writing one additional metadata file.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure requires writing
>>>>>>>>>>>>>>>>>> at least one manifest and a new manifest list to produce a 
>>>>>>>>>>>>>>>>>> new snapshot.
>>>>>>>>>>>>>>>>>> The goal of this work is to allow more flexibility by 
>>>>>>>>>>>>>>>>>> allowing the manifest
>>>>>>>>>>>>>>>>>> list layer to store data and delete files. As a result, only 
>>>>>>>>>>>>>>>>>> one file write
>>>>>>>>>>>>>>>>>> would be needed before committing the new snapshot. In 
>>>>>>>>>>>>>>>>>> addition, this work
>>>>>>>>>>>>>>>>>> will also try to explore:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Avoiding small manifests that must be read in
>>>>>>>>>>>>>>>>>>    parallel and later compacted (metadata maintenance 
>>>>>>>>>>>>>>>>>> changes)
>>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use aggregated column
>>>>>>>>>>>>>>>>>>    ranges that are compatible with geospatial data (manifest 
>>>>>>>>>>>>>>>>>> metadata)
>>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid rewriting existing
>>>>>>>>>>>>>>>>>>    manifests (metadata DVs)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you’re interested in these problems, please reply!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to