Following up here to summarize the discussion in the sync: 1. Generally, we want all attributes modelled as first class entities in metadata. The main question that was a little bit hard to answer is: how to satisfy use-cases that might be important to some users of Iceberg that aren't necessarily deemed to have wide enough utility for adding them spec by the community. There were a few examples discussed in the doc. I'll follow-up on these as separate threads for possibly adding them as metadata.
2. There was generally a concern around collusion or a shadow specification emerging by having key-value attributes. In general for other projects that have key-value pairs this has generally not been an issue. Thanks, Micah Link to video: https://drive.google.com/file/d/17DrvLm-1eUY8gCjctWnGgYICj5dphzpn/view?usp=sharing (Note due to my company's security settings this will only be available for ~8 hours, if we want to archive this more permanently could some with access to the youtube channel copy it over there). On Fri, Mar 27, 2026 at 1:43 PM Russell Spitzer <[email protected]> wrote: > I'm also leaning towards not adding this to the Spec. The more I think > about it, the more it feels like it will > just be a way to "fork" Iceberg with vendor specific functionality. If > someone wants > to do that, they can always just add fields to the metadata they generate, > but I'm not sure we should explicitly bless it. > > The more I think about the copy behavior, the less I like the idea of > having an "outside definition" of the field. If we always > drop the field, then what's the point of having it in the spec. If we do > copy it, how can we assure the copy won't invalidate > the value stored? Just feels like we aren't really getting anything out of > this change. > > On Thu, Mar 26, 2026 at 6:14 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Prashant, >> I unfortunately, I have conflicts on Wednesdays for the foreseeable >> future at that time. Hopefully between the sync and mailing list we can >> figure out a path forward. If anybody else has feedback please add it to >> the Google doc or reply to the thread and I can address it. >> >> Thanks, >> Micah >> >> On Thursday, March 26, 2026, Prashant Singh <[email protected]> >> wrote: >> >>> Thank you for being flexible Micah, how about we add this to the agenda >>> item in iceberg community sync which is just a day after at 9 pm, a lot of >>> folks join and we will have better participation. >>> and it seems like we would have time to talk since i see the agenda is >>> still open, if we can't conclude we can have a dedicated sync for it. >>> >>> Best, >>> Prashant Singh >>> >>> On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> Thanks Kevin for accepting. Thanks for your feedback Prashant, since >>>> you have been active reviewing, I moved the event to Tuesday at a time that >>>> you mentioned you would be available, hopefully this doesn't exclude >>>> anybody else who wants to join the conversation. >>>> >>>> Thanks, >>>> Micah >>>> >>>> On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh < >>>> [email protected]> wrote: >>>> >>>>> Thanks for bumping this thread Micah and thank you for all the work ! >>>>> I missed this thread completely, apologies for that, I have so far been >>>>> responding to the design docs (would be nice to link ML to doc too). >>>>> >>>>> For the feedback, I am not supportive of this proposal and I am >>>>> looking forward to hear from other community members on despite these >>>>> severe con why we should be doing it specially given we have clear >>>>> aligned path on how to introduce these by in backward compatible way >>>>> >>>>> Here are my reservations : >>>>> 1/ while the proposal says one can limit the default size 512B, it >>>>> says it is configurable, this can severely impact the number of entries we >>>>> can have in a manifest file, we went through the whole exercise of >>>>> whether >>>>> we should have inline manifest dv or not, and based on tradeoff we >>>>> concluded one over the other. Giving this much of size in the worst case >>>>> per data file inside the manifest can severely impact the query planning >>>>> time and query execution cost (will more IO) of the iceberg readers which >>>>> may be different than who produced the iceberg data set. >>>>> 2/ It works on an assumption we need to do spec version bump to add >>>>> new fields, which i think is not completely true we added things like >>>>> partition stats / statistic field as optional, i don't understand why cant >>>>> we do the same, specially with things like schema_id and footer_size >>>>> mentioned as motivation. I think the community >>>>> was pretty aligned to have schema_id as optional field to have writer >>>>> backward compatibility as all new writers taking the benefit of this [1] >>>>> 3/ one of motivations thats is stated is to support Vendors >>>>> proprietary metadata for supporting their proprietary clustering >>>>> algorithm, >>>>> this to me looks like a way to work around spec to let iceberg metadata >>>>> layout carry these info which doesn't means anything to iceberg ecosystem >>>>> and can compromise interoperability. >>>>> Also think of a case where Vendor A starts producing something >>>>> partnering with Vendor B and to make things worse encrypt it and not let >>>>> vendor C not in this partnership see it. IMHO we should not open up new >>>>> ways that hurt the interop. >>>>> >>>>> I also want to thank you for proposing the meeting, unfortunately the >>>>> proposed time doesn't work for me, i have a conflicting meeting, please >>>>> feel free to proceed without me, I can watch the recording later as well, >>>>> as far as my support is concerned I look forward to answers that strongly >>>>> supporting this use case and why should we be ok accepting these cons >>>>> given >>>>> we already had a well thought path to move forward. >>>>> >>>>> [1] https://github.com/apache/iceberg/pull/4898 >>>>> >>>>> Best, >>>>> Prashant Singh >>>>> >>>>> >>>>> >>>>> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]> >>>>> wrote: >>>>> >>>>>> I added/accepted on the dev calendar. Looking forward to it! >>>>>> >>>>>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> It seems like we might not have full alignment on this proposal, I >>>>>>> tentatively scheduled a sync for next Monday (added to the iceberg dev >>>>>>> events calendar). Please let me know if you are interested in joining >>>>>>> and >>>>>>> the time doesn't work for you (we can reschedule accordingly). >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>>> On 2026/02/09 23:15:49 Micah Kornfield wrote: >>>>>>> > As an update I've made the proposal to add this field to the >>>>>>> Single file >>>>>>> > commits doc. >>>>>>> > >>>>>>> > Please let me know if there is any additional feedback. >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Micah >>>>>>> > >>>>>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield < >>>>>>> [email protected]> >>>>>>> > wrote: >>>>>>> > >>>>>>> > > Thanks Manu, that is the right doc. >>>>>>> > > >>>>>>> > > As an update, I've incorporated feedback from the community to >>>>>>> the >>>>>>> > > document: >>>>>>> > > >>>>>>> > > At a high level the changes are: >>>>>>> > > - Renamed the field from "tags" to "attributes" >>>>>>> > > - Clarified limits on attributes should only be enforced for new >>>>>>> data. >>>>>>> > > Existing tags must always be carried through. >>>>>>> > > - Added more details on enforcing size of tags. >>>>>>> > > >>>>>>> > > Are there any objections to folding the proposal into the V4 >>>>>>> metadata >>>>>>> > > proposal? Again, the reasons for doing so are mostly around >>>>>>> ensuring >>>>>>> > > consistent field numbering and making the spec update easier. >>>>>>> > > >>>>>>> > > If people want further discussion on this I'd be happy to >>>>>>> discuss at the >>>>>>> > > next V4 metadata sync or create a one-off meeting. Please let >>>>>>> me know. >>>>>>> > > >>>>>>> > > Thanks, >>>>>>> > > Micah >>>>>>> > > >>>>>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang < >>>>>>> [email protected]> wrote: >>>>>>> > > >>>>>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg >>>>>>> Single File >>>>>>> > >> Commits) ? >>>>>>> > >> I think you are referring to >>>>>>> > >> >>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>>>> > >> >>>>>>> > >> Best, >>>>>>> > >> Manu >>>>>>> > >> >>>>>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield < >>>>>>> [email protected]> >>>>>>> > >> wrote: >>>>>>> > >> >>>>>>> > >>> Happy new year everyone, I just wanted to bump this thread >>>>>>> (most >>>>>>> > >>> discussion has been happening on the doc [1]) in case it was >>>>>>> missed over >>>>>>> > >>> the holidays. >>>>>>> > >>> >>>>>>> > >>> Thanks, >>>>>>> > >>> Micah >>>>>>> > >>> >>>>>>> > >>> [1] >>>>>>> > >>> >>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>>>>>> > >>> >>>>>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield < >>>>>>> [email protected]> >>>>>>> > >>> wrote: >>>>>>> > >>> >>>>>>> > >>>> Sounds good, will wait until next year. >>>>>>> > >>>> >>>>>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu < >>>>>>> [email protected]> wrote: >>>>>>> > >>>> >>>>>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we >>>>>>> extend >>>>>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new >>>>>>> year? >>>>>>> > >>>>> >>>>>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield < >>>>>>> [email protected]> >>>>>>> > >>>>> wrote: >>>>>>> > >>>>> >>>>>>> > >>>>>> > I have no problem with adding this discussion to the >>>>>>> single file >>>>>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like >>>>>>> this is a pretty >>>>>>> > >>>>>> independent addition to the metadata layout? >>>>>>> > >>>>>> >>>>>>> > >>>>>> Yes, it is fairly independent. The main reason I wanted to >>>>>>> > >>>>>> consolidate in the doc, it appears there is a bit of >>>>>>> metadata >>>>>>> > >>>>>> re-arrangement and new fields. I wanted to make sure that: >>>>>>> > >>>>>> >>>>>>> > >>>>>> 1. We avoid field ID conflicts. >>>>>>> > >>>>>> 2. When writing up the final spec changes it is easy to >>>>>>> manage and >>>>>>> > >>>>>> not create a dependency one way or another between the two >>>>>>> of these. >>>>>>> > >>>>>> >>>>>>> > >>>>>> Happy to keep the implementation of the guard-rails as a >>>>>>> separate >>>>>>> > >>>>>> piece of work. >>>>>>> > >>>>>> >>>>>>> > >>>>>> Cheers, >>>>>>> > >>>>>> Micah >>>>>>> > >>>>>> >>>>>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer < >>>>>>> > >>>>>> [email protected]> wrote: >>>>>>> > >>>>>> >>>>>>> > >>>>>>> I have no problem with adding this discussion to the >>>>>>> single file >>>>>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like >>>>>>> this is a pretty >>>>>>> > >>>>>>> independent addition to the metadata layout? >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield < >>>>>>> > >>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>> >>>>>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly >>>>>>> call out >>>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags >>>>>>> must be strictly >>>>>>> > >>>>>>>>> optional and never required for correctness or basic >>>>>>> functionality. Engines >>>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags >>>>>>> without breaking reads >>>>>>> > >>>>>>>>> or writes, with the only possible impact being >>>>>>> suboptimal behavior (e.g., >>>>>>> > >>>>>>>>> extra I/O), as you described. >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> 100% I will also add this summary to the bottom of the >>>>>>> requirements >>>>>>> > >>>>>>>> section. >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack >>>>>>> > >>>>>>>> thereof), it does not seem like there are strong >>>>>>> objections to adding this >>>>>>> > >>>>>>>> for V4. Prashant seemed to maybe have concerns, so I'd >>>>>>> like to understand >>>>>>> > >>>>>>>> if they are blockers. >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> If there isn't additional feedback by the end of next >>>>>>> week, I'd >>>>>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with >>>>>>> the single file >>>>>>> > >>>>>>>> improvement work, which has already reorganized the >>>>>>> metadata schema [1]. >>>>>>> > >>>>>>>> Please let me know if there is a different process. >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> Thanks, >>>>>>> > >>>>>>>> Micah >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> [1] >>>>>>> > >>>>>>>> >>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu < >>>>>>> [email protected]> >>>>>>> > >>>>>>>> wrote: >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to >>>>>>> explicitly call out >>>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags >>>>>>> must be strictly >>>>>>> > >>>>>>>>> optional and never required for correctness or basic >>>>>>> functionality. Engines >>>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags >>>>>>> without breaking reads >>>>>>> > >>>>>>>>> or writes, with the only possible impact being >>>>>>> suboptimal behavior (e.g., >>>>>>> > >>>>>>>>> extra I/O), as you described. >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> As long as this constraint is clearly stated and >>>>>>> enforced, the >>>>>>> > >>>>>>>>> trade-off feels reasonable to me. >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> Yufei >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield < >>>>>>> > >>>>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>>> Hi Yufei, >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>>>>>> reasons(like >>>>>>> > >>>>>>>>>>> clustering algorithm), would data file >>>>>>> rewrite(compaction) by another >>>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on >>>>>>> it. >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> The intent here is that dropping tags should never >>>>>>> break an >>>>>>> > >>>>>>>>>> engine. But it could cause suboptimal operations. For >>>>>>> instance, one >>>>>>> > >>>>>>>>>> example I brought in the docs is using tags to cache >>>>>>> parquet footer size, >>>>>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O. >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> In this case the following would occur. >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> 1. Engine 1 does a write to file 1 and records its >>>>>>> footer size >>>>>>> > >>>>>>>>>> in tags. >>>>>>> > >>>>>>>>>> 2. Engine 2 does a rewrite/compactions and produces >>>>>>> File 2 >>>>>>> > >>>>>>>>>> without tags. >>>>>>> > >>>>>>>>>> 3. Engine 1 then tries to read file 2. The tag for >>>>>>> footer >>>>>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable >>>>>>> number of bytes >>>>>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire >>>>>>> footer is retrieved >>>>>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary). >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result >>>>>>> could >>>>>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps >>>>>>> redundant clustering >>>>>>> > >>>>>>>>>> operations but shouldn't break anything. This is no >>>>>>> worse then the case >>>>>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different >>>>>>> clustering algorithms >>>>>>> > >>>>>>>>>> and they are being run in interleaved fashion on the >>>>>>> same table. In this >>>>>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate >>>>>>> compaction is >>>>>>> > >>>>>>>>>> happening. >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> In the current proposal, any metadata that is required >>>>>>> for proper >>>>>>> > >>>>>>>>>> functioning should never be put in tags. >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> Thanks, >>>>>>> > >>>>>>>>>> Micah >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu < >>>>>>> [email protected]> >>>>>>> > >>>>>>>>>> wrote: >>>>>>> > >>>>>>>>>> >>>>>>> > >>>>>>>>>>> Thanks for the proposal! >>>>>>> > >>>>>>>>>>> >>>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain >>>>>>> reasons(like >>>>>>> > >>>>>>>>>>> clustering algorithm), would data file >>>>>>> rewrite(compaction) by another >>>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on >>>>>>> it. >>>>>>> > >>>>>>>>>>> >>>>>>> > >>>>>>>>>>> Yufei >>>>>>> > >>>>>>>>>>> >>>>>>> > >>>>>>>>>>> >>>>>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield < >>>>>>> > >>>>>>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> Hi Iceberg Dev, >>>>>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field >>>>>>> for files >>>>>>> > >>>>>>>>>>>> in V4 metadata [2]. More details are in the document >>>>>>> but the intent is to >>>>>>> > >>>>>>>>>>>> allow engines to store optional metadata associated >>>>>>> with these files: >>>>>>> > >>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> 1. The proposed field is optional and cannot be used >>>>>>> for >>>>>>> > >>>>>>>>>>>> metadata required for reading the table correctly. >>>>>>> > >>>>>>>>>>>> 2. It also proposes guard-rails for not letting tags >>>>>>> cause >>>>>>> > >>>>>>>>>>>> metadata bloat. >>>>>>> > >>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and >>>>>>> feedback. >>>>>>> > >>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> Thanks, >>>>>>> > >>>>>>>>>>>> Micah >>>>>>> > >>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815 >>>>>>> > >>>>>>>>>>>> [2] >>>>>>> > >>>>>>>>>>>> >>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>>>> > >>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> >>>>>>> > >>>>>>> >>>>>>
