Re: [DISCUSS] Adding Tags field to Iceberg V4

Micah Kornfield Wed, 21 Jan 2026 17:16:40 -0800

Thanks Manu, that is the right doc.

As an update, I've incorporated feedback from the community to the document:


At a high level the changes are:
- Renamed the field from "tags" to "attributes"
- Clarified limits on attributes should only be enforced for new data.
Existing tags must always be carried through.
- Added more details on enforcing size of tags.

Are there any objections to folding the proposal into the V4 metadata
proposal?  Again, the reasons for doing so are mostly around ensuring
consistent field numbering and making the spec update easier.

If people want further discussion on this I'd be happy to discuss at the
next V4 metadata sync or create a one-off meeting.  Please let me know.

Thanks,
Micah

On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <[email protected]> wrote:

> Happy new year Micah. Are you linking the wrong doc (Iceberg Single File
> Commits) ?
> I think you are referring to
> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>
> Best,
> Manu
>
> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <[email protected]>
> wrote:
>
>> Happy new year everyone, I just wanted to bump this thread (most
>> discussion has been happening on the doc [1]) in case it was missed over
>> the holidays.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>
>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Sounds good, will wait until next year.
>>>
>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <[email protected]> wrote:
>>>
>>>> Micah, many people will be OOO in the next two weeks. Can we extend the
>>>> feedback deadline to at least 1-2 weeks after the new year?
>>>>
>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>>> > I have no problem with adding this discussion to the single file
>>>>> work, but I'm not sure that would speed it up? Seems like this is a pretty
>>>>> independent addition to the metadata layout?
>>>>>
>>>>> Yes, it is fairly independent.  The main reason I wanted to
>>>>> consolidate in the doc, it appears there is  a bit of metadata
>>>>> re-arrangement and new fields.  I wanted to make sure that:
>>>>>
>>>>> 1.  We avoid field ID conflicts.
>>>>> 2.  When writing up the final spec changes it is easy to manage and
>>>>> not create a dependency one way or another between the two of these.
>>>>>
>>>>> Happy to keep the implementation of the guard-rails as a separate
>>>>> piece of work.
>>>>>
>>>>> Cheers,
>>>>> Micah
>>>>>
>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I have no problem with adding this discussion to the single file
>>>>>> work, but I'm not sure that would speed it up? Seems like this is a 
>>>>>> pretty
>>>>>> independent addition to the metadata layout?
>>>>>>
>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the clarification, Micah! I want to explicitly call out
>>>>>>>> (and double-confirm) the key principle here: all tags must be strictly
>>>>>>>> optional and never required for correctness or basic functionality. 
>>>>>>>> Engines
>>>>>>>> should always be able to safely drop or ignore tags without breaking 
>>>>>>>> reads
>>>>>>>> or writes, with the only possible impact being suboptimal behavior 
>>>>>>>> (e.g.,
>>>>>>>> extra I/O), as you described.
>>>>>>>
>>>>>>>
>>>>>>> 100% I will also add this summary to the bottom of the requirements
>>>>>>> section.
>>>>>>>
>>>>>>> Based on mailing list discussion and doc comments (or lack thereof),
>>>>>>> it does not seem like there are strong objections to adding this for V4.
>>>>>>> Prashant seemed to maybe have concerns, so I'd like to understand if 
>>>>>>> they
>>>>>>> are blockers.
>>>>>>>
>>>>>>> If there isn't additional feedback by the end of next week, I'd like
>>>>>>> to assume a lazy consensus and consolidate this with the single file
>>>>>>> improvement work, which has already reorganized the metadata schema [1].
>>>>>>> Please let me know if there is a different process.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>>>>>
>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the clarification, Micah! I want to explicitly call out
>>>>>>>> (and double-confirm) the key principle here: all tags must be strictly
>>>>>>>> optional and never required for correctness or basic functionality. 
>>>>>>>> Engines
>>>>>>>> should always be able to safely drop or ignore tags without breaking 
>>>>>>>> reads
>>>>>>>> or writes, with the only possible impact being suboptimal behavior 
>>>>>>>> (e.g.,
>>>>>>>> extra I/O), as you described.
>>>>>>>>
>>>>>>>> As long as this constraint is clearly stated and enforced, the
>>>>>>>> trade-off feels reasonable to me.
>>>>>>>>
>>>>>>>> Yufei
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Yufei,
>>>>>>>>>
>>>>>>>>>> If one engine started to rely on a tag for certain reasons(like
>>>>>>>>>> clustering algorithm), would data file rewrite(compaction) by another
>>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The intent here is that dropping tags should never break an
>>>>>>>>> engine.  But it could cause suboptimal operations.  For instance, one
>>>>>>>>> example I brought in the docs is using tags to cache parquet footer 
>>>>>>>>> size,
>>>>>>>>> to make sure it is fetched in 1 I/O.
>>>>>>>>>
>>>>>>>>> In this case the following would occur.
>>>>>>>>>
>>>>>>>>> 1.  Engine 1 does a write to file 1 and records its footer size in
>>>>>>>>> tags.
>>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces File 2
>>>>>>>>> without tags.
>>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for footer length
>>>>>>>>> is missing so it falls back reading a reasonable number of bytes from 
>>>>>>>>> the
>>>>>>>>> end of the parquet file, hoping the entire footer is retrieved (and 
>>>>>>>>> if it
>>>>>>>>> isn't a second I/O is necessary).
>>>>>>>>>
>>>>>>>>> Similarly for clustering algorithms, I think the result could
>>>>>>>>> yield a sub-optimally clustered table, or perhaps redundant clustering
>>>>>>>>> operations but shouldn't break anything. This is no worse then the 
>>>>>>>>> case
>>>>>>>>> today though if engine 1 and engine 2 have different clustering 
>>>>>>>>> algorithms
>>>>>>>>> and they are being run in interleaved fashion on the same table.  In 
>>>>>>>>> this
>>>>>>>>> case it is highly likely that some amount of duplicate compaction is
>>>>>>>>> happening.
>>>>>>>>>
>>>>>>>>> In the current proposal, any metadata that is required for proper
>>>>>>>>> functioning should never be put in tags.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>>
>>>>>>>>>> If one engine started to rely on a tag for certain reasons(like
>>>>>>>>>> clustering algorithm), would data file rewrite(compaction) by another
>>>>>>>>>> engine remove the tag, and break the engine relying on it.
>>>>>>>>>>
>>>>>>>>>> Yufei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Iceberg Dev,
>>>>>>>>>>> I added a proposal [1] to add a key-value tags field for files
>>>>>>>>>>> in V4 metadata [2].  More details are in the document but the 
>>>>>>>>>>> intent is to
>>>>>>>>>>> allow engines to store optional metadata associated with these 
>>>>>>>>>>> files:
>>>>>>>>>>>
>>>>>>>>>>> 1.  The proposed field is optional and cannot be used for
>>>>>>>>>>> metadata required for reading the table correctly.
>>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags cause
>>>>>>>>>>> metadata bloat.
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to hearing everyone's thoughts and feedback.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Micah
>>>>>>>>>>>
>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>>>>>>>>> [2]
>>>>>>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to