Re: [DISCUSS] v4 - One file commits

Steven Wu Tue, 03 Mar 2026 08:17:38 -0800

My takeaway from the conversation is also that we don't need row-level
column updates. Manifest DV can be used for row-level updates instead.
Basically, a file (manifest or data) can be updated via (1) delete vector +
updated rows in a new file (2) column file overlay. Depends on the
percentage of modified rows, engines can choose which way to go.


On Tue, Mar 3, 2026 at 6:24 AM Gábor Kaszab <[email protected]> wrote:

> Thanks for the summary, Micah! I tried to watch the recording linked to
> the calendar event, but apparently I don't have permission to do so. Not
> sure about others.
>
> So if 'm not mistaken, one way to reduce the write cost of an UPDATE for
> colocated DVs is to use the column updates. As I see there was some
> agreement that row-level partial column updates aren't desired, and we aim
> for at least file-level column updates. This is very useful information for
> the other conversation
> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny> going
> on for the column update proposal. We can bring this up on the column
> update sync tomorrow, but I'm wondering if the consensus on avoiding
> row-level column updates is something we can incorporate into the column
> update proposal too or if it's something still up to debate.
>
> Best Regards,
> Gabor
>
> Micah Kornfield <[email protected]> ezt írta (időpont: 2026. febr.
> 25., Sze, 22:30):
>
>> Just wanted to summarize my main takeaways of Monday's sync.
>>
>> The approach will always collocate DVs with the data files (i.e. every
>> data file row in a manifest has an optional DV reference).  This implies
>> that there is not a separate "Deletion manifest".  Rather in V4 all
>> manifests are "combined" where data files and DVs are colocated.
>>
>> Write amplification is avoided in two ways:
>> 1.  For small updates we will need to  carry through metadata statistics
>> (and other relevant data file fields) in memory (rescanning these is likely
>> two expensive).    Once updates are available they will be written out a
>> new manifest (either root or leaf) and use metadata DVs to remove the old
>> rows.
>> 2.  For larger updates we will only carry through the DV update parts in
>> memory and use column level updates to replace existing DVs (this would
>> require rescanning the DV columns for any updated manifest to merge with
>> the updated DVs in memory, and then writing out the column update). The
>> consensus on the call is that we didn't want to support partial  column
>> updates (a.k.a. merge-on-read column updates).
>>
>> The idea is that engines would decide which path to follow based on the
>> number of affected files.
>>
>> To help understand the implications of the new proposal, I put together a
>> quick spreadsheet [1] to analyze trade-offs between separate deletion
>> manifests and the new approach under scenario 1 and 2.  This represents the
>> worst case scenario where file updates are uniformly distributed across a
>> single update operation.  It does not account for repeated writes (e.g.
>> on-going compaction).  My main take-aways is that keeping at most 1
>> affiliated DV separate might still help (akin to a merge on read column
>> update), but maybe not enough relative to other parts of the system (e.g.
>> the churn on data files) that the complexity.
>>
>> Hope this is helpful.
>>
>> Micah
>>
>> [1]
>> https://docs.google.com/spreadsheets/d/1klZQxV7ST2C-p9LTMmai_5rtFiyupj6jSLRPRkdI-u8/edit?gid=0#gid=0
>>
>>
>>
>> On Thu, Feb 19, 2026 at 3:52 PM Amogh Jahagirdar <[email protected]>
>> wrote:
>>
>>> Hey folks, I've set up an additional initial discussion on DVs for
>>> Monday. This topic is fairly complex and there is also now a free calendar
>>> slot. I think it'd be helpful for us to first make sure we're all on the
>>> same page in terms of what the approach proposed by Anton earlier in the
>>> thread means and the high level mechanics. I should also have more to share
>>> on the doc about how the entry structure and change detection could look
>>> like in this approach. Then on Thursday we can get into more details and
>>> targeted points of discussion on this topic.
>>>
>>> Thanks,
>>> Amogh Jahagirdar
>>>
>>> On Tue, Feb 17, 2026 at 9:27 PM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> Thanks Steven! I've set up some time next Thursday for the community to
>>>> discuss this. We're also looking at how the content entry would look like
>>>> in a combined DV with potential column updates for DV changes, and how
>>>> change detection could look like in this approach. I should have more to
>>>> share on this by the time of the community discussion next week.
>>>> We should also consider potential root churn and memory consumption
>>>> stemming from expected root entry inflation due to a combined data file +
>>>> DV entry with possible column updates for certain DV workloads; though at
>>>> least for memory consumption of stats being held after planning, that
>>>> arguably is an implementation problem for certain integrations.
>>>>
>>>> Thanks,
>>>> Amogh Jahagirdar
>>>>
>>>> On Fri, Feb 13, 2026 at 10:58 AM Steven Wu <[email protected]>
>>>> wrote:
>>>>
>>>>> I wrote up some analysis with back-of-the-envelope calculations about
>>>>> the column update approach for DV colocation. It mainly concerns the 2nd
>>>>> use case: deleting a large number of rows from a small number of files.
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.gvdulzy486n7
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 4, 2026 at 1:02 AM Péter Váry <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I fully agree with Anton and Steven that we need benchmarks before
>>>>>> choosing any direction.
>>>>>>
>>>>>> I ran some preliminary column‑stitching benchmarks last summer:
>>>>>>
>>>>>>    - Results are available in the doc:
>>>>>>    
>>>>>> https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
>>>>>>    - Code is here: https://github.com/apache/iceberg/pull/13306
>>>>>>
>>>>>> I’ve summarized the most relevant results at the end of this email.
>>>>>> They show roughly a 10% slowdown on the read path with column stitching 
>>>>>> in
>>>>>> similar scenarios when using local SSDs. I expect that in real 
>>>>>> deployments
>>>>>> the metadata read cost will mostly be driven by blob I/O (assuming no
>>>>>> caching). If blob access becomes the dominant factor in read latency,
>>>>>> multithreaded fetching should be able to absorb the overhead introduced 
>>>>>> by
>>>>>> column stitching, resulting in latency similar to the single‑file layout
>>>>>> (unless IO is already the bottleneck)
>>>>>>
>>>>>> We should definitely rerun the benchmarks once we have a clearer
>>>>>> understanding of the intended usage patterns.
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> The relevant(ish) results are for 100 columns, with 2 families with
>>>>>> 50-50 columns and local read:
>>>>>>
>>>>>> The base is:
>>>>>> MultiThreadedParquetBenchmark.read        100           0
>>>>>>  false    ss   20   3.739 ±  0.096   s/op
>>>>>>
>>>>>> The read for single threaded:
>>>>>> MultiThreadedParquetBenchmark.read        100           2
>>>>>>  false    ss   20   4.036 ±  0.082   s/op
>>>>>>
>>>>>> The read for multi threaded:
>>>>>> MultiThreadedParquetBenchmark.read        100           2
>>>>>> true    ss   20   4.063 ±  0.080   s/op
>>>>>>
>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. febr. 3.,
>>>>>> K, 23:27):
>>>>>>
>>>>>>>
>>>>>>> I agree with Anton in this
>>>>>>> <https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o/edit?disco=AAAByzDx21w>
>>>>>>> comment thread that we probably need to run benchmarks for a few common
>>>>>>> scenarios to guide this decision. We need to write down detailed plans 
>>>>>>> for
>>>>>>> those scenarios and what are we measuring. Also ideally, we want to 
>>>>>>> measure
>>>>>>> using the V4 metadata structure (like Parquet manifest file, column 
>>>>>>> stats
>>>>>>> structs, adaptive tree). There are PoC PRs available for column stats,
>>>>>>> Parquet manifest, and root manifest. It would probably be tricky to 
>>>>>>> piece
>>>>>>> them together to run the benchmark considering the PoC status. We also 
>>>>>>> need
>>>>>>> the column stitching capability on the read path to test the column file
>>>>>>> approach.
>>>>>>>
>>>>>>> On Tue, Feb 3, 2026 at 1:53 PM Anoop Johnson <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm in favor of co-located DV metadata with column file override
>>>>>>>> and not doing affiliated/unaffiliated delete manifests. This is
>>>>>>>> conceptually similar to strictly affiliated delete manifests with
>>>>>>>> positional joins, and will halve the number of I/Os when there is no DV
>>>>>>>> column override. It is simpler to implement
>>>>>>>> and will speed up reads.
>>>>>>>>
>>>>>>>> Unaffiliated DV manifests are flexible for writers. They reduce the
>>>>>>>> chance of physical conflicts when there are concurrent large/random 
>>>>>>>> deletes
>>>>>>>> that change DVs on different files in the same manifest. But the
>>>>>>>> flexibility comes at a read-time cost. If the number of unaffiliated 
>>>>>>>> DVs
>>>>>>>> exceeds a threshold, it could cause driver OOMs or require distributed 
>>>>>>>> join
>>>>>>>> to pair up DVs with data files. With colocated metadata, manifest DVs 
>>>>>>>> can
>>>>>>>> reduce the chance of conflicts up to a certain write size.
>>>>>>>>
>>>>>>>> I assume we will still support unaffiliated manifests for equality
>>>>>>>> deletes, but perhaps we can restrict it to just equality deletes.
>>>>>>>>
>>>>>>>> -Anoop
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 2, 2026 at 4:27 PM Anton Okolnychyi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I added the approach with column files to the doc.
>>>>>>>>>
>>>>>>>>> To sum up, separate data and delete manifests with affinity
>>>>>>>>> would perform somewhat on par with co-located DV metadata (a.k.a. 
>>>>>>>>> direct
>>>>>>>>> assignment) if we add support for column files when we need to 
>>>>>>>>> replace most
>>>>>>>>> or all DVs (use case 1). That said, the support for direct assignment 
>>>>>>>>> with
>>>>>>>>> in-line metadata DVs can help us avoid unaffiliated delete manifests 
>>>>>>>>> when
>>>>>>>>> we need to replace a few DVs (use case 2).
>>>>>>>>>
>>>>>>>>> So the key question is whether we want to allow
>>>>>>>>> unaffiliated delete manifests with DVs... If we don't, then we would 
>>>>>>>>> likely
>>>>>>>>> want to have co-located DV metadata and must support efficient column
>>>>>>>>> updates not to regress compared to V2 and V3 for large MERGE jobs that
>>>>>>>>> modify a small set of records for most files.
>>>>>>>>>
>>>>>>>>> пн, 2 лют. 2026 р. о 13:20 Anton Okolnychyi <[email protected]>
>>>>>>>>> пише:
>>>>>>>>>
>>>>>>>>>> Anoop, correct, if we keep data and delete manifests separate,
>>>>>>>>>> there is a better way to combine the entries and we should NOT rely 
>>>>>>>>>> on the
>>>>>>>>>> referenced data file path. Reconciling by implicit position will 
>>>>>>>>>> reduce the
>>>>>>>>>> size of the DV entry (no need to store the referenced data file 
>>>>>>>>>> path) and
>>>>>>>>>> will improve the planning performance (no equals/hashCode on the 
>>>>>>>>>> path).
>>>>>>>>>>
>>>>>>>>>> Steven, I agree. Most notes in the doc pre-date discussions we
>>>>>>>>>> had on column updates. You are right, given that we are gravitating 
>>>>>>>>>> towards
>>>>>>>>>> a native way to handle column updates, it seems logical to use the 
>>>>>>>>>> same
>>>>>>>>>> approach for replacing DVs, since they’re essentially column 
>>>>>>>>>> updates. Let
>>>>>>>>>> me add one more approach to the doc based on what Anurag and Peter 
>>>>>>>>>> have so
>>>>>>>>>> far.
>>>>>>>>>>
>>>>>>>>>> нд, 1 лют. 2026 р. о 20:59 Steven Wu <[email protected]> пише:
>>>>>>>>>>
>>>>>>>>>>> Anton, thanks for raising this. I agree this deserves another
>>>>>>>>>>> look. I added a comment in your doc that we can potentially apply 
>>>>>>>>>>> the
>>>>>>>>>>> column update proposal for data file update to the manifest file 
>>>>>>>>>>> updates as
>>>>>>>>>>> well, to colocate the data DV and data manifest files. Data DVs can 
>>>>>>>>>>> be a
>>>>>>>>>>> separate column in the data manifest file and updated separately in 
>>>>>>>>>>> a
>>>>>>>>>>> column file. This is the same as the coalesced positional join that 
>>>>>>>>>>> Anoop
>>>>>>>>>>> mentioned.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 1, 2026 at 4:14 PM Anoop Johnson <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you for raising this, Anton. I had a similar observation
>>>>>>>>>>>> while prototyping
>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/14533> the
>>>>>>>>>>>> adaptive metadata tree. The overhead of doing a path-based hash 
>>>>>>>>>>>> join of a
>>>>>>>>>>>> data manifest with the affiliated delete manifest is high: my 
>>>>>>>>>>>> estimate was
>>>>>>>>>>>> that the join adds about 5-10% overhead. The hash table 
>>>>>>>>>>>> build/probe alone
>>>>>>>>>>>> takes about 5 ms for manifests with 25K entries. There are engines 
>>>>>>>>>>>> that can
>>>>>>>>>>>> do vectorized hash joins that can lower this, but the overhead and
>>>>>>>>>>>> complexity of a SIMD-friendly hash join is non-trivial.
>>>>>>>>>>>>
>>>>>>>>>>>> An alternative to relying on the external file feature in
>>>>>>>>>>>> Parquet, is to make affiliated manifests order-preserving: ie DVs 
>>>>>>>>>>>> in an
>>>>>>>>>>>> affiliated delete manifest must appear in the same position as the
>>>>>>>>>>>> corresponding data file in the data manifest the delete manifest is
>>>>>>>>>>>> affiliated to.  If a data file does not have a DV, the DV manifest 
>>>>>>>>>>>> must
>>>>>>>>>>>> store a NULL. This would allow us to do positional joins, which 
>>>>>>>>>>>> are much
>>>>>>>>>>>> faster. If we wanted, we could even have multiple affiliated DV 
>>>>>>>>>>>> manifests
>>>>>>>>>>>> for a data manifest and the reader would do a COALESCED positional 
>>>>>>>>>>>> join
>>>>>>>>>>>> (i.e. pick the first non-null value as the DV). It puts the sorting
>>>>>>>>>>>> responsibility to the writers, but it might be a reasonable 
>>>>>>>>>>>> tradeoff.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, the options don't necessarily have to be mutually
>>>>>>>>>>>> exclusive. We could still allow affiliated DVs to be "folded" into 
>>>>>>>>>>>> data
>>>>>>>>>>>> manifest (e.g. by background optimization jobs or the writer 
>>>>>>>>>>>> itself). That
>>>>>>>>>>>> might be the optimal choice for read-heavy tables because it will 
>>>>>>>>>>>> halve the
>>>>>>>>>>>> number of I/Os readers have to make.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Anoop
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jan 30, 2026 at 6:03 PM Anton Okolnychyi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I had a chance to catch up on some of the V4 discussions.
>>>>>>>>>>>>> Given that we are getting rid of the manifest list and switching 
>>>>>>>>>>>>> to
>>>>>>>>>>>>> Parquet, I wanted to re-evaluate the possibility of direct DV 
>>>>>>>>>>>>> assignment
>>>>>>>>>>>>> that we discarded in V3 to avoid regressions. I have put together 
>>>>>>>>>>>>> my
>>>>>>>>>>>>> thoughts in a doc [1].
>>>>>>>>>>>>>
>>>>>>>>>>>>> TL;DR:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - I think the current V4 proposal that keeps data and delete
>>>>>>>>>>>>> manifests separate but introduces affinity is a solid choice for 
>>>>>>>>>>>>> cases when
>>>>>>>>>>>>> we need to replace DVs in many / most files. I outlined an 
>>>>>>>>>>>>> approach with
>>>>>>>>>>>>> column-split Parquet files but it doesn't improve the performance 
>>>>>>>>>>>>> and takes
>>>>>>>>>>>>> dependency on a portion of the Parquet spec that is not really 
>>>>>>>>>>>>> implemented.
>>>>>>>>>>>>> - Pushing unaffiliated DVs directly into the root to replace a
>>>>>>>>>>>>> small set of DVs is going to be fast on write but does require 
>>>>>>>>>>>>> resolving
>>>>>>>>>>>>> where those DVs apply at read time. Using inline metadata DVs with
>>>>>>>>>>>>> column-split Parquet files is a little more promising in this 
>>>>>>>>>>>>> case as it
>>>>>>>>>>>>> allows to avoid unaffiliated DVs. That said, it again relies on 
>>>>>>>>>>>>> something
>>>>>>>>>>>>> Parquet doesn't implement right now, requires changing maintenance
>>>>>>>>>>>>> operations, and yields minimal benefits.
>>>>>>>>>>>>>
>>>>>>>>>>>>> All in all, the V4 proposal seems like a strict improvement
>>>>>>>>>>>>> over V3 but I insist that we reconsider usage of the referenced 
>>>>>>>>>>>>> data file
>>>>>>>>>>>>> path when resolving DVs to data files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] -
>>>>>>>>>>>>> https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>
>>>>>>>>>>>>> сб, 22 лист. 2025 р. о 13:37 Amogh Jahagirdar <
>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is the meeting recording
>>>>>>>>>>>>>> <https://drive.google.com/file/d/1lG9sM-JTwqcIgk7JsAryXXCc1vMnstJs/view?usp=sharing>
>>>>>>>>>>>>>>  and generated meeting summary
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1e50p8TXL2e3CnUwKMOvm8F4s2PeVMiKWHPxhxOW1fIM/edit?usp=sharing>.
>>>>>>>>>>>>>> Thanks all for attending yesterday!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 20, 2025 at 8:49 AM Amogh Jahagirdar <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I was out for some time, but set up a sync for tomorrow at
>>>>>>>>>>>>>>> 9am PST. For this discussion, I do think it would be great to 
>>>>>>>>>>>>>>> focus on the
>>>>>>>>>>>>>>> manifest DV representation, factoring in analyses on bitmap 
>>>>>>>>>>>>>>> representation
>>>>>>>>>>>>>>> storage footprints, and the entry structure considering how we 
>>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>> approach change detection. If there are other topics that 
>>>>>>>>>>>>>>> people want to
>>>>>>>>>>>>>>> highlight, please do bring those up as well!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also recognize that this is a bit short term scheduling,
>>>>>>>>>>>>>>> so please do reach out to me if this time is difficult to work 
>>>>>>>>>>>>>>> with; next
>>>>>>>>>>>>>>> week is the Thanksgiving holidays here, and since people would 
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> travelling/out I figured I'd try to schedule before then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 17, 2025 at 9:03 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry for the delay, here's the recording link
>>>>>>>>>>>>>>>> <https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view>
>>>>>>>>>>>>>>>>   from
>>>>>>>>>>>>>>>> last week's discussion.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025 at 9:44 AM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Same here.
>>>>>>>>>>>>>>>>> Please record if you can.
>>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Amogh,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the write-up. Unfortunately, I won’t be able
>>>>>>>>>>>>>>>>>> to attend. Will it be recorded? Thanks!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've setup time this Friday at 9am PST for another sync
>>>>>>>>>>>>>>>>>>> on single file commits. In terms of what would be great to 
>>>>>>>>>>>>>>>>>>> focus on for the
>>>>>>>>>>>>>>>>>>> discussion:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Whether it makes sense or not to eliminate the tuple,
>>>>>>>>>>>>>>>>>>> and instead representing the tuple via lower/upper 
>>>>>>>>>>>>>>>>>>> boundaries. As a
>>>>>>>>>>>>>>>>>>> reminder, one of the goals is to avoid tying a partition 
>>>>>>>>>>>>>>>>>>> spec to a
>>>>>>>>>>>>>>>>>>> manifest; in the root we can have a mix of files spanning 
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> partition specs, and even in leaf manifests avoiding this 
>>>>>>>>>>>>>>>>>>> coupling can
>>>>>>>>>>>>>>>>>>> enable more desirable clustering of metadata.
>>>>>>>>>>>>>>>>>>> In the vast majority of cases, we could leverage the
>>>>>>>>>>>>>>>>>>> property that a file is effectively partitioned if the 
>>>>>>>>>>>>>>>>>>> lower/upper for a
>>>>>>>>>>>>>>>>>>> given field is equal. The nuance here is with the 
>>>>>>>>>>>>>>>>>>> particular case of
>>>>>>>>>>>>>>>>>>> identity partitioned string/binary columns which can be 
>>>>>>>>>>>>>>>>>>> truncated in stats.
>>>>>>>>>>>>>>>>>>> One approach is to require that writers must not produce 
>>>>>>>>>>>>>>>>>>> truncated stats
>>>>>>>>>>>>>>>>>>> for identity partitioned columns. It's also important to 
>>>>>>>>>>>>>>>>>>> keep in mind that
>>>>>>>>>>>>>>>>>>> all of this is just for the purpose of reconstructing the 
>>>>>>>>>>>>>>>>>>> partition tuple,
>>>>>>>>>>>>>>>>>>> which is only required during equality delete matching. 
>>>>>>>>>>>>>>>>>>> Another area we
>>>>>>>>>>>>>>>>>>> need to cover as part of this is on exact bounds on stats. 
>>>>>>>>>>>>>>>>>>> There are other
>>>>>>>>>>>>>>>>>>> options here as well such as making all new equality 
>>>>>>>>>>>>>>>>>>> deletes in V4 be
>>>>>>>>>>>>>>>>>>> global and instead match based on bounds, or keeping the 
>>>>>>>>>>>>>>>>>>> tuple but each
>>>>>>>>>>>>>>>>>>> tuple is effectively based off a union schema of all 
>>>>>>>>>>>>>>>>>>> partition specs. I am
>>>>>>>>>>>>>>>>>>> adding a separate appendix section outlining the span of 
>>>>>>>>>>>>>>>>>>> options here and
>>>>>>>>>>>>>>>>>>> the different tradeoffs.
>>>>>>>>>>>>>>>>>>> Once we get this more to a conclusive state, I'll move a
>>>>>>>>>>>>>>>>>>> summarized version to the main doc.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. @[email protected] <[email protected]> has
>>>>>>>>>>>>>>>>>>> updated the doc with a section
>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>>>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>>>> how we can do change detection from the root in a variety 
>>>>>>>>>>>>>>>>>>> of write
>>>>>>>>>>>>>>>>>>> scenarios. I've done a review on it, and it covers the 
>>>>>>>>>>>>>>>>>>> cases I would
>>>>>>>>>>>>>>>>>>> expect. It'd be good for folks to take a look and please 
>>>>>>>>>>>>>>>>>>> give feedback
>>>>>>>>>>>>>>>>>>> before we discuss. Thank you Steven for adding that section 
>>>>>>>>>>>>>>>>>>> and all the
>>>>>>>>>>>>>>>>>>> diagrams.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey folks just following up from the discussion last
>>>>>>>>>>>>>>>>>>>> Friday with a summary and some next steps:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1.) For the various change detection cases, we
>>>>>>>>>>>>>>>>>>>> concluded it's best just to go through those in an offline 
>>>>>>>>>>>>>>>>>>>> manner on the
>>>>>>>>>>>>>>>>>>>> doc since it's hard to verify all that correctness in a 
>>>>>>>>>>>>>>>>>>>> large meeting
>>>>>>>>>>>>>>>>>>>> setting.
>>>>>>>>>>>>>>>>>>>> 2.) We mostly discussed eliminating the
>>>>>>>>>>>>>>>>>>>> partition tuple. On the original proposal, I was mostly 
>>>>>>>>>>>>>>>>>>>> aiming for the
>>>>>>>>>>>>>>>>>>>> ability to re-constructing the tuple from the stats for 
>>>>>>>>>>>>>>>>>>>> the purpose of
>>>>>>>>>>>>>>>>>>>> equality delete matching (a file is partitioned if the 
>>>>>>>>>>>>>>>>>>>> lower and upper
>>>>>>>>>>>>>>>>>>>> bounds are equal); There's some nuance in how we need to 
>>>>>>>>>>>>>>>>>>>> handle identity
>>>>>>>>>>>>>>>>>>>> partition values since for string/binary they cannot be 
>>>>>>>>>>>>>>>>>>>> truncated.
>>>>>>>>>>>>>>>>>>>> Another potential option is to treat all equality deletes 
>>>>>>>>>>>>>>>>>>>> as effectively
>>>>>>>>>>>>>>>>>>>> global and narrow their application based on the stats 
>>>>>>>>>>>>>>>>>>>> values. This may
>>>>>>>>>>>>>>>>>>>> require defining tight bounds. I'm still collecting my 
>>>>>>>>>>>>>>>>>>>> thoughts on this one.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks folks! Please also let me know if any of the
>>>>>>>>>>>>>>>>>>>> following links are inaccessible for any reason.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Meeting recording link:
>>>>>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Meeting summary:
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Update: I moved the discussion time to this Friday at
>>>>>>>>>>>>>>>>>>>>> 9 am PST since I found out that quite a few folks 
>>>>>>>>>>>>>>>>>>>>> involved in the proposals
>>>>>>>>>>>>>>>>>>>>> will be out next week, and I also know some folks will 
>>>>>>>>>>>>>>>>>>>>> also be out the week
>>>>>>>>>>>>>>>>>>>>> after that.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Amogh J
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey folks sorry for the late follow up here,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks @Kevin Liu <[email protected]> for
>>>>>>>>>>>>>>>>>>>>>> sharing the recording link of the previous discussion! 
>>>>>>>>>>>>>>>>>>>>>> I've set up another
>>>>>>>>>>>>>>>>>>>>>> sync for next Tuesday 09/16 at 9am PST. This time I've 
>>>>>>>>>>>>>>>>>>>>>> set it up from my
>>>>>>>>>>>>>>>>>>>>>> corporate email so we can get recordings and 
>>>>>>>>>>>>>>>>>>>>>> transcriptions (and I've made
>>>>>>>>>>>>>>>>>>>>>> sure to keep the meeting invite open so we don't have to 
>>>>>>>>>>>>>>>>>>>>>> manually let
>>>>>>>>>>>>>>>>>>>>>> people in).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> In terms of next steps of areas which I think would
>>>>>>>>>>>>>>>>>>>>>> be good to focus on for establishing consensus:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1. How do we model the manifest entry structure
>>>>>>>>>>>>>>>>>>>>>> so that changes to manifest DVs can be obtained easily 
>>>>>>>>>>>>>>>>>>>>>> from the root? There
>>>>>>>>>>>>>>>>>>>>>> are a few options here; the most promising approach is 
>>>>>>>>>>>>>>>>>>>>>> to keep an
>>>>>>>>>>>>>>>>>>>>>> additional DV which encodes the diff in additional 
>>>>>>>>>>>>>>>>>>>>>> positions which have
>>>>>>>>>>>>>>>>>>>>>> been removed from a leaf manifest.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2. Modeling partition transforms via expressions and
>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space so that we can 
>>>>>>>>>>>>>>>>>>>>>> simplify how partition
>>>>>>>>>>>>>>>>>>>>>> tuples may be represented via stats and also have a way 
>>>>>>>>>>>>>>>>>>>>>> in the future to
>>>>>>>>>>>>>>>>>>>>>> store stats on any derived column. I have a short
>>>>>>>>>>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>>>>>>>>>>>>>>>>>>  for
>>>>>>>>>>>>>>>>>>>>>> this that probably still needs some tightening up on the 
>>>>>>>>>>>>>>>>>>>>>> expression
>>>>>>>>>>>>>>>>>>>>>> modeling itself (and some prototyping) but the general 
>>>>>>>>>>>>>>>>>>>>>> idea for
>>>>>>>>>>>>>>>>>>>>>> establishing a unified table ID space is covered. All 
>>>>>>>>>>>>>>>>>>>>>> feedback welcome!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks Amogh. Looks like the recording for last
>>>>>>>>>>>>>>>>>>>>>>> week's sync is available on Youtube. Here's the link,
>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Just following up on this to give the community as
>>>>>>>>>>>>>>>>>>>>>>>> to where we're at and my proposed next steps.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I've been editing and merging the contents from our
>>>>>>>>>>>>>>>>>>>>>>>> proposal into the proposal
>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>>>>>>>>>>>>>>>>>>  from
>>>>>>>>>>>>>>>>>>>>>>>> Russell and others. For any future comments on docs, 
>>>>>>>>>>>>>>>>>>>>>>>> please comment on the
>>>>>>>>>>>>>>>>>>>>>>>> linked proposal. I've also marked it on our doc in red 
>>>>>>>>>>>>>>>>>>>>>>>> text so it's clear
>>>>>>>>>>>>>>>>>>>>>>>> to redirect to the other proposal as a source of truth 
>>>>>>>>>>>>>>>>>>>>>>>> for comments.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> In terms of next steps,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1. An important design decision point is around
>>>>>>>>>>>>>>>>>>>>>>>> inline manifest DVs, external manifest DVs or enabling 
>>>>>>>>>>>>>>>>>>>>>>>> both. I'm working on
>>>>>>>>>>>>>>>>>>>>>>>> measuring different approaches for representing the 
>>>>>>>>>>>>>>>>>>>>>>>> compressed DV
>>>>>>>>>>>>>>>>>>>>>>>> representation since that will inform how many entries 
>>>>>>>>>>>>>>>>>>>>>>>> can reasonably fit
>>>>>>>>>>>>>>>>>>>>>>>> in a small root manifest; from that we can derive 
>>>>>>>>>>>>>>>>>>>>>>>> implications on different
>>>>>>>>>>>>>>>>>>>>>>>> write patterns and determine the right approach for 
>>>>>>>>>>>>>>>>>>>>>>>> storing these manifest
>>>>>>>>>>>>>>>>>>>>>>>> DVs.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2. Another key point is around determining if/how
>>>>>>>>>>>>>>>>>>>>>>>> we can reasonably enable V4 to represent changes in 
>>>>>>>>>>>>>>>>>>>>>>>> the root manifest so
>>>>>>>>>>>>>>>>>>>>>>>> that readers can effectively just infer file level 
>>>>>>>>>>>>>>>>>>>>>>>> changes from the root.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 3. One of the aspects of the proposal is getting
>>>>>>>>>>>>>>>>>>>>>>>> away from partition tuple requirement in the root 
>>>>>>>>>>>>>>>>>>>>>>>> which currently holds us
>>>>>>>>>>>>>>>>>>>>>>>> to have associativity between a partition spec and a 
>>>>>>>>>>>>>>>>>>>>>>>> manifest. These
>>>>>>>>>>>>>>>>>>>>>>>> aspects can be modeled as essentially column stats 
>>>>>>>>>>>>>>>>>>>>>>>> which gives a lot of
>>>>>>>>>>>>>>>>>>>>>>>> flexibility into the organization of the manifest. 
>>>>>>>>>>>>>>>>>>>>>>>> There are important
>>>>>>>>>>>>>>>>>>>>>>>> details around field ID spaces here which tie into how 
>>>>>>>>>>>>>>>>>>>>>>>> the stats are
>>>>>>>>>>>>>>>>>>>>>>>> structured. What we're proposing here is to have a 
>>>>>>>>>>>>>>>>>>>>>>>> unified expression ID
>>>>>>>>>>>>>>>>>>>>>>>> space that could also benefit us for storing things 
>>>>>>>>>>>>>>>>>>>>>>>> like virtual columns
>>>>>>>>>>>>>>>>>>>>>>>> down the line. I go into this in the proposal but I'm 
>>>>>>>>>>>>>>>>>>>>>>>> working on separating
>>>>>>>>>>>>>>>>>>>>>>>> the appropriate parts so that the original proposal 
>>>>>>>>>>>>>>>>>>>>>>>> can mostly just focus
>>>>>>>>>>>>>>>>>>>>>>>> on the organization of the content metadata tree and 
>>>>>>>>>>>>>>>>>>>>>>>> not how we want to
>>>>>>>>>>>>>>>>>>>>>>>> solve this particular ID space problem.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 4. I'm planning on scheduling a recurring community
>>>>>>>>>>>>>>>>>>>>>>>> sync starting next Tuesday at 9am PST, every 2 weeks. 
>>>>>>>>>>>>>>>>>>>>>>>> If I get feedback
>>>>>>>>>>>>>>>>>>>>>>>> from folks that this time will never work, I can 
>>>>>>>>>>>>>>>>>>>>>>>> certainly adjust. For some
>>>>>>>>>>>>>>>>>>>>>>>> reason, I don't have the ability to add to the Iceberg 
>>>>>>>>>>>>>>>>>>>>>>>> Dev calendar, so
>>>>>>>>>>>>>>>>>>>>>>>> I'll figure that out and update the thread when the 
>>>>>>>>>>>>>>>>>>>>>>>> event is scheduled.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I think this is a great way forward, starting out
>>>>>>>>>>>>>>>>>>>>>>>>> with this much parallel development shows that we 
>>>>>>>>>>>>>>>>>>>>>>>>> have a lot of consensus
>>>>>>>>>>>>>>>>>>>>>>>>> already :)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hey folks, just following up on this. It looks
>>>>>>>>>>>>>>>>>>>>>>>>>> like our proposal and the proposal that @Russell
>>>>>>>>>>>>>>>>>>>>>>>>>> Spitzer <[email protected]> shared are
>>>>>>>>>>>>>>>>>>>>>>>>>> pretty aligned. I was just chatting with Russell 
>>>>>>>>>>>>>>>>>>>>>>>>>> about this, and we think
>>>>>>>>>>>>>>>>>>>>>>>>>> it'd be best to combine both proposals and have a 
>>>>>>>>>>>>>>>>>>>>>>>>>> singular large effort on
>>>>>>>>>>>>>>>>>>>>>>>>>> this. I can also set up a focused community 
>>>>>>>>>>>>>>>>>>>>>>>>>> discussion (similar to what
>>>>>>>>>>>>>>>>>>>>>>>>>> we're doing on the other V4 proposals) on this 
>>>>>>>>>>>>>>>>>>>>>>>>>> starting sometime next week
>>>>>>>>>>>>>>>>>>>>>>>>>> just to get things moving, if that works for people.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Russell,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing the proposal! A few of us
>>>>>>>>>>>>>>>>>>>>>>>>>>> (Ryan, Dan, Anoop and I) have also been working on 
>>>>>>>>>>>>>>>>>>>>>>>>>>> a proposal for an
>>>>>>>>>>>>>>>>>>>>>>>>>>> adaptive metadata tree structure as part of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> enabling more efficient one
>>>>>>>>>>>>>>>>>>>>>>>>>>> file commits. From a read of the summary, it's 
>>>>>>>>>>>>>>>>>>>>>>>>>>> great to see that we're
>>>>>>>>>>>>>>>>>>>>>>>>>>> thinking along the same lines about how to tackle 
>>>>>>>>>>>>>>>>>>>>>>>>>>> this fundamental area!
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is our proposal:
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> share some
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of the thoughts we had on how one-file commits
>>>>>>>>>>>>>>>>>>>>>>>>>>>> could work in Iceberg. This is pretty
>>>>>>>>>>>>>>>>>>>>>>>>>>>> much just a high level overview of the concepts
>>>>>>>>>>>>>>>>>>>>>>>>>>>> we think we need and how Iceberg would behave.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We haven't gone very far into the actual
>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation and changes that would need to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> occur in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   A Root manifest can have data manifests,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete manifests, manifest delete vectors, data 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete vectors and data
>>>>>>>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Manifest delete vectors allow for modifying a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifest without deleting it entirely
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Data files let you append without writing an
>>>>>>>>>>>>>>>>>>>>>>>>>>>> intermediary manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   Having child data and delete manifests lets
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you still scale
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm excited to see what other proposals and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ideas are floating around the community,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm very interested in this initiative. Micah
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kornfield and I presented
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on high-throughput ingestion for Iceberg tables 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the 2024 Iceberg Summit,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which leveraged Google infrastructure like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colossus for efficient appends.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This new proposal is particularly exciting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it offers significant advancements in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> commit latency and metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> storage footprint. Furthermore, a consistent 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifest structure promises to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simplify the design and codebase, which is a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> major benefit.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A related idea I've been exploring is having
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a loose affinity between data and delete 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifests. While the current
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> separation of data and delete manifests in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg is valuable for avoiding
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data file rewrites (and stats updates) when 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> deletes change, it does
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> necessitate a join operation during reads. I'd 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be keen to discuss
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> approaches that could potentially reduce this 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> read-side cost while
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> retaining the benefits of separate manifests.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sidhu <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am new to the Iceberg community but would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> love to participate in these discussions to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reduce the number of file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> writes, especially for small writes/commits.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We have been hitting all the metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problems you mentioned, Ryan. I’m on-board to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help however I can to improve
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this area.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheng <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in this idea and looking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> forward to collaboration.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in contributing to this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> effort.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jahagirdar <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm interested in helping out here! I've been 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> working on a proposal in this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> area and it would be great to collaborate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with different folks and exchange
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ideas here, since I think a lot of people are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in solving this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a thread to connect those of us that are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in the idea of changing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg’s metadata in v4 so that in most 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases committing a change only
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing one additional metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing at least one manifest and a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new manifest list to produce a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new snapshot. The goal of this work is to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allow more flexibility by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowing the manifest list layer to store 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data and delete files. As a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> result, only one file write would be needed 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before committing the new
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot. In addition, this work will also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try to explore:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Avoiding small manifests that must
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    be read in parallel and later compacted 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (metadata maintenance changes)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    aggregated column ranges that are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatible with geospatial data (manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    metadata)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    rewriting existing manifests (metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DVs)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you’re interested in these problems,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> please reply!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to