I have no problem with adding this discussion to the single file work, but I'm not sure that would speed it up? Seems like this is a pretty independent addition to the metadata layout?
On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <[email protected]> wrote: > Thanks for the clarification, Micah! I want to explicitly call out (and >> double-confirm) the key principle here: all tags must be strictly optional >> and never required for correctness or basic functionality. Engines should >> always be able to safely drop or ignore tags without breaking reads or >> writes, with the only possible impact being suboptimal behavior (e.g., >> extra I/O), as you described. > > > 100% I will also add this summary to the bottom of the requirements > section. > > Based on mailing list discussion and doc comments (or lack thereof), it > does not seem like there are strong objections to adding this for V4. > Prashant seemed to maybe have concerns, so I'd like to understand if they > are blockers. > > If there isn't additional feedback by the end of next week, I'd like to > assume a lazy consensus and consolidate this with the single file > improvement work, which has already reorganized the metadata schema [1]. > Please let me know if there is a different process. > > Thanks, > Micah > > [1] > https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw > > On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <[email protected]> wrote: > >> Thanks for the clarification, Micah! I want to explicitly call out (and >> double-confirm) the key principle here: all tags must be strictly optional >> and never required for correctness or basic functionality. Engines should >> always be able to safely drop or ignore tags without breaking reads or >> writes, with the only possible impact being suboptimal behavior (e.g., >> extra I/O), as you described. >> >> As long as this constraint is clearly stated and enforced, the trade-off >> feels reasonable to me. >> >> Yufei >> >> >> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <[email protected]> >> wrote: >> >>> Hi Yufei, >>> >>>> If one engine started to rely on a tag for certain reasons(like >>>> clustering algorithm), would data file rewrite(compaction) by another >>>> engine remove the tag, and break the engine relying on it. >>> >>> >>> The intent here is that dropping tags should never break an engine. But >>> it could cause suboptimal operations. For instance, one example I brought >>> in the docs is using tags to cache parquet footer size, to make sure it is >>> fetched in 1 I/O. >>> >>> In this case the following would occur. >>> >>> 1. Engine 1 does a write to file 1 and records its footer size in tags. >>> 2. Engine 2 does a rewrite/compactions and produces File 2 without tags. >>> 3. Engine 1 then tries to read file 2. The tag for footer length is >>> missing so it falls back reading a reasonable number of bytes from the end >>> of the parquet file, hoping the entire footer is retrieved (and if it isn't >>> a second I/O is necessary). >>> >>> Similarly for clustering algorithms, I think the result could yield a >>> sub-optimally clustered table, or perhaps redundant clustering operations >>> but shouldn't break anything. This is no worse then the case today though >>> if engine 1 and engine 2 have different clustering algorithms and they are >>> being run in interleaved fashion on the same table. In this case it is >>> highly likely that some amount of duplicate compaction is happening. >>> >>> In the current proposal, any metadata that is required for proper >>> functioning should never be put in tags. >>> >>> Thanks, >>> Micah >>> >>> >>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]> wrote: >>> >>>> Thanks for the proposal! >>>> >>>> If one engine started to rely on a tag for certain reasons(like >>>> clustering algorithm), would data file rewrite(compaction) by another >>>> engine remove the tag, and break the engine relying on it. >>>> >>>> Yufei >>>> >>>> >>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <[email protected]> >>>> wrote: >>>> >>>>> Hi Iceberg Dev, >>>>> I added a proposal [1] to add a key-value tags field for files in V4 >>>>> metadata [2]. More details are in the document but the intent is to allow >>>>> engines to store optional metadata associated with these files: >>>>> >>>>> 1. The proposed field is optional and cannot be used for metadata >>>>> required for reading the table correctly. >>>>> 2. It also proposes guard-rails for not letting tags cause metadata >>>>> bloat. >>>>> >>>>> Looking forward to hearing everyone's thoughts and feedback. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> [1] https://github.com/apache/iceberg/issues/14815 >>>>> [2] >>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >>>>> >>>>>
