Re: [DISCUSS] Adding Tags field to Iceberg V4

Micah Kornfield Fri, 19 Dec 2025 08:47:22 -0800

> I have no problem with adding this discussion to the single file work,
but I'm not sure that would speed it up? Seems like this is a pretty
independent addition to the metadata layout?


Yes, it is fairly independent.  The main reason I wanted to consolidate in
the doc, it appears there is  a bit of metadata re-arrangement and new
fields.  I wanted to make sure that:

1.  We avoid field ID conflicts.
2.  When writing up the final spec changes it is easy to manage and not
create a dependency one way or another between the two of these.

Happy to keep the implementation of the guard-rails as a separate piece of
work.

Cheers,
Micah

On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <[email protected]>
wrote:

> I have no problem with adding this discussion to the single file work, but
> I'm not sure that would speed it up? Seems like this is a pretty
> independent addition to the metadata layout?
>
> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Thanks for the clarification, Micah! I want to explicitly call out (and
>>> double-confirm) the key principle here: all tags must be strictly optional
>>> and never required for correctness or basic functionality. Engines should
>>> always be able to safely drop or ignore tags without breaking reads or
>>> writes, with the only possible impact being suboptimal behavior (e.g.,
>>> extra I/O), as you described.
>>
>>
>> 100% I will also add this summary to the bottom of the requirements
>> section.
>>
>> Based on mailing list discussion and doc comments (or lack thereof), it
>> does not seem like there are strong objections to adding this for V4.
>> Prashant seemed to maybe have concerns, so I'd like to understand if they
>> are blockers.
>>
>> If there isn't additional feedback by the end of next week, I'd like to
>> assume a lazy consensus and consolidate this with the single file
>> improvement work, which has already reorganized the metadata schema [1].
>> Please let me know if there is a different process.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>
>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <[email protected]> wrote:
>>
>>> Thanks for the clarification, Micah! I want to explicitly call out (and
>>> double-confirm) the key principle here: all tags must be strictly optional
>>> and never required for correctness or basic functionality. Engines should
>>> always be able to safely drop or ignore tags without breaking reads or
>>> writes, with the only possible impact being suboptimal behavior (e.g.,
>>> extra I/O), as you described.
>>>
>>> As long as this constraint is clearly stated and enforced, the trade-off
>>> feels reasonable to me.
>>>
>>> Yufei
>>>
>>>
>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Hi Yufei,
>>>>
>>>>> If one engine started to rely on a tag for certain reasons(like
>>>>> clustering algorithm), would data file rewrite(compaction) by another
>>>>> engine remove the tag, and break the engine relying on it.
>>>>
>>>>
>>>> The intent here is that dropping tags should never break an engine.
>>>> But it could cause suboptimal operations.  For instance, one example I
>>>> brought in the docs is using tags to cache parquet footer size, to make
>>>> sure it is fetched in 1 I/O.
>>>>
>>>> In this case the following would occur.
>>>>
>>>> 1.  Engine 1 does a write to file 1 and records its footer size in tags.
>>>> 2.  Engine 2 does a rewrite/compactions and produces File 2 without
>>>> tags.
>>>> 3.  Engine 1 then tries to read file 2.  The tag for footer length is
>>>> missing so it falls back reading a reasonable number of bytes from the end
>>>> of the parquet file, hoping the entire footer is retrieved (and if it isn't
>>>> a second I/O is necessary).
>>>>
>>>> Similarly for clustering algorithms, I think the result could yield a
>>>> sub-optimally clustered table, or perhaps redundant clustering operations
>>>> but shouldn't break anything. This is no worse then the case today though
>>>> if engine 1 and engine 2 have different clustering algorithms and they are
>>>> being run in interleaved fashion on the same table.  In this case it is
>>>> highly likely that some amount of duplicate compaction is happening.
>>>>
>>>> In the current proposal, any metadata that is required for proper
>>>> functioning should never be put in tags.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]> wrote:
>>>>
>>>>> Thanks for the proposal!
>>>>>
>>>>> If one engine started to rely on a tag for certain reasons(like
>>>>> clustering algorithm), would data file rewrite(compaction) by another
>>>>> engine remove the tag, and break the engine relying on it.
>>>>>
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Iceberg Dev,
>>>>>> I added a proposal [1] to add a key-value tags field for files in V4
>>>>>> metadata [2].  More details are in the document but the intent is to 
>>>>>> allow
>>>>>> engines to store optional metadata associated with these files:
>>>>>>
>>>>>> 1.  The proposed field is optional and cannot be used for metadata
>>>>>> required for reading the table correctly.
>>>>>> 2.  It also proposes guard-rails for not letting tags cause metadata
>>>>>> bloat.
>>>>>>
>>>>>> Looking forward to hearing everyone's thoughts and feedback.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>>>> [2]
>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>>
>>>>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to