Hi All, I am sharing a new proposal for Iceberg Spec v4: *Delta-Encoded Schemas*. We propose moving away from monolithic schema storage to address a growing scalability bottleneck in high-velocity and ultra-wide table environments.
The current Iceberg Spec re-serializes and appends the entire schema object to metadata.json for every schema operation, which leads to massive schema data replication. For a large table with 5,000 columns+ with frequent schema updates, this can result in metadata files exceeding GBs, causing significant query planning latencies and OOM driver side. *Proposal Summary:* We propose implementing *Delta-Encoded Schema Evolution for Spec v4* using a *"Merge-on-Read" (MoR) approach for metadata*. This approach involves transitioning the schemas field from "Full Snapshots" to a sequence of *Base Schemas* (type full) and *Schema Deltas* (type delta) that store differential mutations relative to a base ID. *Key Goals:* - Achieve a *99.4% reduction in the size of schema-related metadata*. - Drastically lower the storage and IO requirements for metadata.json. - Accelerate query planning by reducing the JSON payload size. - Preserve self-containment by keeping the schema in the metadata file, avoiding external sidecar files. The full proposal, including the flat resolution model (no delta chaining), the defined set of atomic delta operations (add, update, delete), and the lifecycle/compaction mechanics, is available for review: https://s.apache.org/iceberg-delta-schemas <https://www.google.com/url?source=gmail&sa=E&q=https://s.apache.org/iceberg-delta-schemas> I look forward to your feedback and discussion on the dev list. Thanks Talat
