Re: [DISCUSS] Adding Tags field to Iceberg V4

Micah Kornfield Thu, 16 Apr 2026 12:03:17 -0700

Following up here to summarize the discussion in the sync:

1. Generally, we want all attributes modelled as first class entities in
metadata.  The main question that was a little bit hard to answer is: how
to satisfy use-cases that might be important to some users of Iceberg that
aren't necessarily deemed to have wide enough utility for adding them spec
by the community.  There were a few examples discussed in the doc.  I'll
follow-up on these as separate threads for possibly adding them as metadata.


2.  There was generally a concern around collusion or a shadow
specification emerging by having key-value attributes.  In general for
other projects that have key-value pairs this has generally not been an
issue.

Thanks,
Micah


Link to video:
https://drive.google.com/file/d/17DrvLm-1eUY8gCjctWnGgYICj5dphzpn/view?usp=sharing
(Note due to my company's security settings this will only be available for
~8 hours, if we want to archive this more permanently could some with
access to the youtube channel copy it over there).

On Fri, Mar 27, 2026 at 1:43 PM Russell Spitzer <[email protected]>
wrote:

> I'm also leaning towards not adding this to the Spec. The more I think
> about it, the more it feels like it will
> just be a way to "fork" Iceberg with vendor specific functionality. If
> someone wants
> to do that, they can always just add fields to the metadata they generate,
> but I'm not sure we should explicitly bless it.
>
> The more I think about the copy behavior, the less I like the idea of
> having an "outside definition" of the field. If we always
> drop the field, then what's the point of having it in the spec. If we do
> copy it, how can we assure the copy won't invalidate
> the value stored? Just feels like we aren't really getting anything out of
> this change.
>
> On Thu, Mar 26, 2026 at 6:14 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Prashant,
>> I unfortunately, I have conflicts on Wednesdays for the foreseeable
>> future at that time.  Hopefully between the sync and mailing list we can
>> figure out a path forward.  If anybody else has feedback please add it to
>> the Google doc or reply to the thread and I can address it.
>>
>> Thanks,
>> Micah
>>
>> On Thursday, March 26, 2026, Prashant Singh <[email protected]>
>> wrote:
>>
>>> Thank you for being flexible Micah, how about we add this to the agenda
>>> item in iceberg community sync which is just a day after at 9 pm, a lot of
>>> folks join and we will have better participation.
>>> and it seems like we would have time to talk since i see the agenda is
>>> still open, if we can't conclude we can have a dedicated sync for it.
>>>
>>> Best,
>>> Prashant Singh
>>>
>>> On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Thanks Kevin for accepting.  Thanks for your feedback Prashant, since
>>>> you have been active reviewing, I moved the event to Tuesday at a time that
>>>> you mentioned you would be available, hopefully this doesn't exclude
>>>> anybody else who wants to join the conversation.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks for bumping this thread Micah and thank you for all the work !
>>>>> I missed this thread completely, apologies for that, I have so far been
>>>>> responding to the design docs (would be nice to link ML to doc too).
>>>>>
>>>>> For the feedback, I am not supportive of this proposal and I am
>>>>> looking forward to hear from other community members on despite these
>>>>> severe con why we should be doing it  specially given we have clear
>>>>> aligned path on how to introduce these by in backward compatible way
>>>>>
>>>>> Here are my reservations :
>>>>> 1/ while the proposal says one can limit the default size 512B, it
>>>>> says it is configurable, this can severely impact the number of entries we
>>>>> can have in a manifest file, we went through the whole exercise of  
>>>>> whether
>>>>> we should have inline manifest dv or not, and based on tradeoff we
>>>>> concluded one over the other. Giving this much of size in the worst case
>>>>> per data file inside the manifest can severely impact the query planning
>>>>> time and query execution cost (will more IO) of the iceberg readers which
>>>>> may be different than who produced the iceberg data set.
>>>>> 2/ It works on an assumption we need to do spec version bump to add
>>>>> new fields, which i think is not completely true we added things like
>>>>> partition stats / statistic field as optional, i don't understand why cant
>>>>> we do the same, specially with things like schema_id and footer_size
>>>>> mentioned as motivation. I think the community
>>>>> was pretty aligned to have schema_id as optional field to have writer
>>>>> backward compatibility as all new writers taking the benefit of this [1]
>>>>> 3/ one of motivations thats is stated is to support Vendors
>>>>> proprietary metadata for supporting their proprietary clustering 
>>>>> algorithm,
>>>>> this to me looks like a way to work around spec to let iceberg metadata
>>>>> layout carry these info which doesn't means anything to iceberg ecosystem
>>>>> and can compromise interoperability.
>>>>> Also think of a case where Vendor A starts producing  something
>>>>> partnering with Vendor B and to make things worse encrypt it and not let
>>>>> vendor C not in this partnership see it. IMHO we should not open up new
>>>>> ways that hurt the interop.
>>>>>
>>>>> I also want to thank you for proposing the meeting, unfortunately the
>>>>> proposed time doesn't work for me, i have a conflicting meeting, please
>>>>> feel free to proceed without me, I can watch the recording later as well,
>>>>> as far as my support is concerned I look forward to answers that strongly
>>>>> supporting this use case and why should we be ok accepting these cons 
>>>>> given
>>>>> we already had a well thought path to move forward.
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/4898
>>>>>
>>>>> Best,
>>>>> Prashant Singh
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I added/accepted on the dev calendar. Looking forward to it!
>>>>>>
>>>>>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> It seems like we might not have full alignment on this proposal, I
>>>>>>> tentatively scheduled a sync for next Monday (added to the iceberg dev
>>>>>>> events calendar).  Please let me know if you are interested in joining 
>>>>>>> and
>>>>>>> the time doesn't work for you (we can reschedule accordingly).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> On 2026/02/09 23:15:49 Micah Kornfield wrote:
>>>>>>> > As an update I've made the proposal to add this field to the
>>>>>>> Single file
>>>>>>> > commits doc.
>>>>>>> >
>>>>>>> > Please let me know if there is any additional feedback.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Micah
>>>>>>> >
>>>>>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield <
>>>>>>> [email protected]>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> > > Thanks Manu, that is the right doc.
>>>>>>> > >
>>>>>>> > > As an update, I've incorporated feedback from the community to
>>>>>>> the
>>>>>>> > > document:
>>>>>>> > >
>>>>>>> > > At a high level the changes are:
>>>>>>> > > - Renamed the field from "tags" to "attributes"
>>>>>>> > > - Clarified limits on attributes should only be enforced for new
>>>>>>> data.
>>>>>>> > > Existing tags must always be carried through.
>>>>>>> > > - Added more details on enforcing size of tags.
>>>>>>> > >
>>>>>>> > > Are there any objections to folding the proposal into the V4
>>>>>>> metadata
>>>>>>> > > proposal?  Again, the reasons for doing so are mostly around
>>>>>>> ensuring
>>>>>>> > > consistent field numbering and making the spec update easier.
>>>>>>> > >
>>>>>>> > > If people want further discussion on this I'd be happy to
>>>>>>> discuss at the
>>>>>>> > > next V4 metadata sync or create a one-off meeting.  Please let
>>>>>>> me know.
>>>>>>> > >
>>>>>>> > > Thanks,
>>>>>>> > > Micah
>>>>>>> > >
>>>>>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <
>>>>>>> [email protected]> wrote:
>>>>>>> > >
>>>>>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg
>>>>>>> Single File
>>>>>>> > >> Commits) ?
>>>>>>> > >> I think you are referring to
>>>>>>> > >>
>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>>> > >>
>>>>>>> > >> Best,
>>>>>>> > >> Manu
>>>>>>> > >>
>>>>>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <
>>>>>>> [email protected]>
>>>>>>> > >> wrote:
>>>>>>> > >>
>>>>>>> > >>> Happy new year everyone, I just wanted to bump this thread
>>>>>>> (most
>>>>>>> > >>> discussion has been happening on the doc [1]) in case it was
>>>>>>> missed over
>>>>>>> > >>> the holidays.
>>>>>>> > >>>
>>>>>>> > >>> Thanks,
>>>>>>> > >>> Micah
>>>>>>> > >>>
>>>>>>> > >>> [1]
>>>>>>> > >>>
>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>>>>> > >>>
>>>>>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <
>>>>>>> [email protected]>
>>>>>>> > >>> wrote:
>>>>>>> > >>>
>>>>>>> > >>>> Sounds good, will wait until next year.
>>>>>>> > >>>>
>>>>>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <
>>>>>>> [email protected]> wrote:
>>>>>>> > >>>>
>>>>>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we
>>>>>>> extend
>>>>>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new
>>>>>>> year?
>>>>>>> > >>>>>
>>>>>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <
>>>>>>> [email protected]>
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>
>>>>>>> > >>>>>> > I have no problem with adding this discussion to the
>>>>>>> single file
>>>>>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like
>>>>>>> this is a pretty
>>>>>>> > >>>>>> independent addition to the metadata layout?
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> Yes, it is fairly independent.  The main reason I wanted to
>>>>>>> > >>>>>> consolidate in the doc, it appears there is  a bit of
>>>>>>> metadata
>>>>>>> > >>>>>> re-arrangement and new fields.  I wanted to make sure that:
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> 1.  We avoid field ID conflicts.
>>>>>>> > >>>>>> 2.  When writing up the final spec changes it is easy to
>>>>>>> manage and
>>>>>>> > >>>>>> not create a dependency one way or another between the two
>>>>>>> of these.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> Happy to keep the implementation of the guard-rails as a
>>>>>>> separate
>>>>>>> > >>>>>> piece of work.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> Cheers,
>>>>>>> > >>>>>> Micah
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>>>>>> > >>>>>> [email protected]> wrote:
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>> I have no problem with adding this discussion to the
>>>>>>> single file
>>>>>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like
>>>>>>> this is a pretty
>>>>>>> > >>>>>>> independent addition to the metadata layout?
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <
>>>>>>> > >>>>>>> [email protected]> wrote:
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>>>>> call out
>>>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags
>>>>>>> must be strictly
>>>>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>>>>> functionality. Engines
>>>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags
>>>>>>> without breaking reads
>>>>>>> > >>>>>>>>> or writes, with the only possible impact being
>>>>>>> suboptimal behavior (e.g.,
>>>>>>> > >>>>>>>>> extra I/O), as you described.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> 100% I will also add this summary to the bottom of the
>>>>>>> requirements
>>>>>>> > >>>>>>>> section.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack
>>>>>>> > >>>>>>>> thereof), it does not seem like there are strong
>>>>>>> objections to adding this
>>>>>>> > >>>>>>>> for V4.  Prashant seemed to maybe have concerns, so I'd
>>>>>>> like to understand
>>>>>>> > >>>>>>>> if they are blockers.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> If there isn't additional feedback by the end of next
>>>>>>> week, I'd
>>>>>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with
>>>>>>> the single file
>>>>>>> > >>>>>>>> improvement work, which has already reorganized the
>>>>>>> metadata schema [1].
>>>>>>> > >>>>>>>> Please let me know if there is a different process.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Thanks,
>>>>>>> > >>>>>>>> Micah
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> [1]
>>>>>>> > >>>>>>>>
>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <
>>>>>>> [email protected]>
>>>>>>> > >>>>>>>> wrote:
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to
>>>>>>> explicitly call out
>>>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags
>>>>>>> must be strictly
>>>>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>>>>> functionality. Engines
>>>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags
>>>>>>> without breaking reads
>>>>>>> > >>>>>>>>> or writes, with the only possible impact being
>>>>>>> suboptimal behavior (e.g.,
>>>>>>> > >>>>>>>>> extra I/O), as you described.
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> As long as this constraint is clearly stated and
>>>>>>> enforced, the
>>>>>>> > >>>>>>>>> trade-off feels reasonable to me.
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> Yufei
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>>>>>> > >>>>>>>>> [email protected]> wrote:
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>>> Hi Yufei,
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>>>>> reasons(like
>>>>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>>>>> rewrite(compaction) by another
>>>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on
>>>>>>> it.
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> The intent here is that dropping tags should never
>>>>>>> break an
>>>>>>> > >>>>>>>>>> engine.  But it could cause suboptimal operations.  For
>>>>>>> instance, one
>>>>>>> > >>>>>>>>>> example I brought in the docs is using tags to cache
>>>>>>> parquet footer size,
>>>>>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O.
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> In this case the following would occur.
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> 1.  Engine 1 does a write to file 1 and records its
>>>>>>> footer size
>>>>>>> > >>>>>>>>>> in tags.
>>>>>>> > >>>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces
>>>>>>> File 2
>>>>>>> > >>>>>>>>>> without tags.
>>>>>>> > >>>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for
>>>>>>> footer
>>>>>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable
>>>>>>> number of bytes
>>>>>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire
>>>>>>> footer is retrieved
>>>>>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary).
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result
>>>>>>> could
>>>>>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps
>>>>>>> redundant clustering
>>>>>>> > >>>>>>>>>> operations but shouldn't break anything. This is no
>>>>>>> worse then the case
>>>>>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different
>>>>>>> clustering algorithms
>>>>>>> > >>>>>>>>>> and they are being run in interleaved fashion on the
>>>>>>> same table.  In this
>>>>>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate
>>>>>>> compaction is
>>>>>>> > >>>>>>>>>> happening.
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> In the current proposal, any metadata that is required
>>>>>>> for proper
>>>>>>> > >>>>>>>>>> functioning should never be put in tags.
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> Thanks,
>>>>>>> > >>>>>>>>>> Micah
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <
>>>>>>> [email protected]>
>>>>>>> > >>>>>>>>>> wrote:
>>>>>>> > >>>>>>>>>>
>>>>>>> > >>>>>>>>>>> Thanks for the proposal!
>>>>>>> > >>>>>>>>>>>
>>>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>>>>> reasons(like
>>>>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>>>>> rewrite(compaction) by another
>>>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on
>>>>>>> it.
>>>>>>> > >>>>>>>>>>>
>>>>>>> > >>>>>>>>>>> Yufei
>>>>>>> > >>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>
>>>>>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>>>>>> > >>>>>>>>>>> [email protected]> wrote:
>>>>>>> > >>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>> Hi Iceberg Dev,
>>>>>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field
>>>>>>> for files
>>>>>>> > >>>>>>>>>>>> in V4 metadata [2].  More details are in the document
>>>>>>> but the intent is to
>>>>>>> > >>>>>>>>>>>> allow engines to store optional metadata associated
>>>>>>> with these files:
>>>>>>> > >>>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>> 1.  The proposed field is optional and cannot be used
>>>>>>> for
>>>>>>> > >>>>>>>>>>>> metadata required for reading the table correctly.
>>>>>>> > >>>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags
>>>>>>> cause
>>>>>>> > >>>>>>>>>>>> metadata bloat.
>>>>>>> > >>>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and
>>>>>>> feedback.
>>>>>>> > >>>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>> Thanks,
>>>>>>> > >>>>>>>>>>>> Micah
>>>>>>> > >>>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>>>>> > >>>>>>>>>>>> [2]
>>>>>>> > >>>>>>>>>>>>
>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>>> > >>>>>>>>>>>>
>>>>>>> > >>>>>>>>>>>>
>>>>>>> >
>>>>>>>
>>>>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to