Is this a problem in memory or on disk? I would expect schemas like this to
compress fairly well. Or maybe the issue is sending them to clients? I just
always prefer solutions that are simpler, so in approximate order: 1) don't
keep so many, 2) use generic compression, 3) don't send them if you don't
need to, and 4) change the representation. I just want to make sure we
aren't jumping to 4 when a simpler solution would work.

On Thu, Feb 12, 2026 at 10:45 AM Russell Spitzer <[email protected]>
wrote:

> For very wide tables, I think this becomes a problem with single digit
> numbers of schema changes. My theoretical thought here is we have a table
> with 1000 columns that we add new columns to every hour or so. Unless I
> want to keep my history locked to 24hours (or less) schema bloat is gonna
> be a pretty big issue
>
> On Thu, Feb 12, 2026 at 10:37 AM Ryan Blue <[email protected]> wrote:
>
>> For tables where this is a problem, how are you currently managing older
>> schemas? Older schemas do not need to be kept if there aren't any snapshots
>> that reference them.
>>
>> On Thu, Feb 12, 2026 at 10:24 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> My gut instinct on this is that it's a great idea. I think we probably
>>> need to think a bit more about how to decide on "base" schema promotion but
>>> theoretically this seems like it should be a huge benefit for wide tables.
>>>
>>> On Thu, Feb 12, 2026 at 7:55 AM Talat Uyarer via dev <
>>> [email protected]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am sharing a new proposal for Iceberg Spec v4: *Delta-Encoded
>>>> Schemas*. We propose moving away from monolithic schema storage to
>>>> address a growing scalability bottleneck in high-velocity and ultra-wide
>>>> table environments.
>>>>
>>>> The current Iceberg Spec re-serializes and appends the entire schema
>>>> object to metadata.json for every schema operation, which leads to
>>>> massive schema data replication. For a large table with 5,000 columns+
>>>> with frequent schema updates, this can result in metadata files exceeding
>>>> GBs, causing significant query planning latencies and OOM driver side.
>>>>
>>>> *Proposal Summary:*
>>>>
>>>> We propose implementing *Delta-Encoded Schema Evolution for Spec v4* using
>>>> a *"Merge-on-Read" (MoR) approach for metadata*. This approach
>>>> involves transitioning the schemas field from "Full Snapshots" to a
>>>> sequence of *Base Schemas* (type full) and *Schema Deltas* (type delta)
>>>> that store differential mutations relative to a base ID.
>>>>
>>>> *Key Goals:*
>>>>
>>>>    - Achieve a *99.4% reduction in the size of schema-related metadata*
>>>>    .
>>>>    - Drastically lower the storage and IO requirements for
>>>>    metadata.json.
>>>>    - Accelerate query planning by reducing the JSON payload size.
>>>>    - Preserve self-containment by keeping the schema in the metadata
>>>>    file, avoiding external sidecar files.
>>>>
>>>> The full proposal, including the flat resolution model (no delta
>>>> chaining), the defined set of atomic delta operations (add, update,
>>>> delete), and the lifecycle/compaction mechanics, is available for
>>>> review:
>>>>
>>>> https://s.apache.org/iceberg-delta-schemas
>>>> <https://www.google.com/url?source=gmail&sa=E&q=https://s.apache.org/iceberg-delta-schemas>
>>>>
>>>> I look forward to your feedback and discussion on the dev list.
>>>>
>>>> Thanks
>>>> Talat
>>>>
>>>

Reply via email to