I think this is a great way forward, starting out with this much parallel
development shows that we have a lot of consensus already :)

On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Hey folks, just following up on this. It looks like our proposal and the
> proposal that @Russell Spitzer <russell.spit...@gmail.com> shared are
> pretty aligned. I was just chatting with Russell about this, and we think
> it'd be best to combine both proposals and have a singular large effort on
> this. I can also set up a focused community discussion (similar to what
> we're doing on the other V4 proposals) on this starting sometime next week
> just to get things moving, if that works for people.
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>> Hey Russell,
>>
>> Thanks for sharing the proposal! A few of us (Ryan, Dan, Anoop and I)
>> have also been working on a proposal for an adaptive metadata tree
>> structure as part of enabling more efficient one file commits. From a read
>> of the summary, it's great to see that we're thinking along the same lines
>> about how to tackle this fundamental area!
>>
>> Here is our proposal:
>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Hey y'all!
>>>
>>> We (Yi Fang, Steven Wu and Myself) wanted to share some
>>> of the thoughts we had on how one-file commits could work in Iceberg.
>>> This is pretty
>>> much just a high level overview of the concepts we think we need and how
>>> Iceberg would behave.
>>> We haven't gone very far into the actual implementation and changes that
>>> would need to occur in the
>>> SDK to make this happen.
>>>
>>> The high level summary is:
>>>
>>> Manifest Lists are out
>>> Root Manifests take their place
>>>   A Root manifest can have data manifests, delete manifests, manifest
>>> delete vectors, data delete vectors and data files
>>>   Manifest delete vectors allow for modifying a manifest without
>>> deleting it entirely
>>>   Data files let you append without writing an intermediary manifest
>>>   Having child data and delete manifests lets you still scale
>>>
>>> Please take a look if you like,
>>>
>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>
>>> I'm excited to see what other proposals and Ideas are floating around
>>> the community,
>>> Russ
>>>
>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <jzh...@apache.org> wrote:
>>>
>>>> Very excited about the idea!
>>>>
>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <anoop.k.john...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm very interested in this initiative. Micah Kornfield and I
>>>>> presented <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405>
>>>>> on high-throughput ingestion for Iceberg tables at the 2024 Iceberg 
>>>>> Summit,
>>>>> which leveraged Google infrastructure like Colossus for efficient appends.
>>>>>
>>>>> This new proposal is particularly exciting because it offers
>>>>> significant advancements in commit latency and metadata storage footprint.
>>>>> Furthermore, a consistent manifest structure promises to simplify the
>>>>> design and codebase, which is a major benefit.
>>>>>
>>>>> A related idea I've been exploring is having a loose affinity between
>>>>> data and delete manifests. While the current separation of data and delete
>>>>> manifests in Iceberg is valuable for avoiding data file rewrites (and 
>>>>> stats
>>>>> updates) when deletes change, it does necessitate a join operation during
>>>>> reads. I'd be keen to discuss approaches that could potentially reduce 
>>>>> this
>>>>> read-side cost while retaining the benefits of separate manifests.
>>>>>
>>>>> Best,
>>>>> Anoop
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu <
>>>>> sidhujagde...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I am new to the Iceberg community but would love to participate in
>>>>>> these discussions to reduce the number of file writes, especially for 
>>>>>> small
>>>>>> writes/commits.
>>>>>>
>>>>>> Thank you!
>>>>>> -Jagdeep
>>>>>>
>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag Mantripragada
>>>>>> <amantriprag...@apple.com.invalid> wrote:
>>>>>>
>>>>>>> We have been hitting all the metadata problems you mentioned, Ryan.
>>>>>>> I’m on-board to help however I can to improve this area.
>>>>>>>
>>>>>>>
>>>>>>> ~ Anurag Mantripragada
>>>>>>>
>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng
>>>>>>> <hua...@apple.com.INVALID> wrote:
>>>>>>>
>>>>>>> I am interested in this idea and looking forward to collaboration.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Huang-Hsiang
>>>>>>>
>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <nmk...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am interested in contributing to this effort.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Namratha
>>>>>>>
>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <2am...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for kicking this thread off Ryan, I'm interested in helping
>>>>>>>> out here! I've been working on a proposal in this area and it would be
>>>>>>>> great to collaborate with different folks and exchange ideas here, 
>>>>>>>> since I
>>>>>>>> think a lot of people are interested in solving this problem.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Amogh Jahagirdar
>>>>>>>>
>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> Like Russell’s recent note, I’m starting a thread to connect those
>>>>>>>>> of us that are interested in the idea of changing Iceberg’s metadata 
>>>>>>>>> in v4
>>>>>>>>> so that in most cases committing a change only requires writing one
>>>>>>>>> additional metadata file.
>>>>>>>>>
>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>
>>>>>>>>> The current Iceberg metadata structure requires writing at least
>>>>>>>>> one manifest and a new manifest list to produce a new snapshot. The 
>>>>>>>>> goal of
>>>>>>>>> this work is to allow more flexibility by allowing the manifest list 
>>>>>>>>> layer
>>>>>>>>> to store data and delete files. As a result, only one file write 
>>>>>>>>> would be
>>>>>>>>> needed before committing the new snapshot. In addition, this work 
>>>>>>>>> will also
>>>>>>>>> try to explore:
>>>>>>>>>
>>>>>>>>>    - Avoiding small manifests that must be read in parallel and
>>>>>>>>>    later compacted (metadata maintenance changes)
>>>>>>>>>    - Extend metadata skipping to use aggregated column ranges
>>>>>>>>>    that are compatible with geospatial data (manifest metadata)
>>>>>>>>>    - Using soft deletes to avoid rewriting existing manifests
>>>>>>>>>    (metadata DVs)
>>>>>>>>>
>>>>>>>>> If you’re interested in these problems, please reply!
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>

Reply via email to