Hi Xander,

+1, I agree, this needs a more strict specification, thank you for taking
this up, I'll also take a look on your PR (
https://github.com/apache/iceberg/pull/16527) shortly.

Cheers,
Adam


On Thu, 21 May 2026 at 02:25, Yufei Gu <[email protected]> wrote:

> Hi Xander,
>
> Thanks for digging into this and documenting the current behavior so
> clearly.
>
> +1 on putting these formats into the spec. At least from an
> interoperability perspective, the current situation creates a practical gap
> between "spec compliant" and "cross-implementation compatible."
>
> Yufei
>
>
> On Wed, May 20, 2026 at 3:14 PM Alexander Bailey <[email protected]>
> wrote:
>
>> Hi all,
>>
>> While implementing table encryption in iceberg-rust, we found a couple
>> of undocumented formats that are required for interoperability but are
>> described in the spec only as "implementation-specific." We
>> have reverse-engineered these from Java's implementation to achieve
>> byte-compatibility. Any future implementation (PyIceberg, etc.) would need
>> to do the same.
>>
>> I'd like to propose that we specify the following in the spec, likely as
>> a new appendix or an expansion of the encryption section.
>>
>> 1. StandardKeyMetadata — the file-level key metadata format
>>
>> The `key_metadata` binary field (field 131 in data files, field 519 in
>> manifest lists) uses a versioned Avro encoding in Java's
>> `StandardKeyMetadata`:
>>
>> Wire format: `[version: 1 byte (0x01)] [Avro binary datum]`
>>
>> V1 schema:
>> ```
>> required(0, "encryption_key", binary) -- plaintext DEK
>> optional(1, "aad_prefix", binary) -- per-file AAD prefix for AES-GCM
>> optional(2, "file_length", long) -- encrypted file size (for streaming
>> decryption)
>> ```
>>
>> 2. The encryption-keys list — KEKs vs wrapped DEKs
>>
>> The table-level `encryption-keys` list stores two kinds of entries,
>> distinguished by what `encrypted-by-id` points to:
>>
>> **KEK entries** (`encrypted-by-id` = table master key ID):
>> - `encrypted-key-metadata`: the KEK wrapped by the KMS (opaque,
>> KMS-specific format)
>> - `properties`: includes `"key-timestamp"` (epoch millis) for expiration
>>
>> **Wrapped manifest-list DEK entries** (`encrypted-by-id` = a KEK's
>> key-id):
>> - `encrypted-key-metadata`: the `StandardKeyMetadata` Avro bytes (from #1
>> above) encrypted with AES-GCM using the referenced KEK, with the KEK's
>> timestamp string as AAD
>> - `properties`: empty
>>
>> The convention for distinguishing these two types of entries, and the
>> wrapping scheme (AES-GCM with the KEK timestamp as AAD to prevent
>> tampering), are not documented anywhere in the spec from what I can see.
>>
>> 3. What can stay "implementation-specific"
>>
>> The KEK's `encrypted-key-metadata` is intentionally opaque, it's whatever
>> the KMS returns from `wrapKey`. That's fine to leave unspecified since it's
>> between the implementation and its KMS provider.
>>
>> ### Why this matters
>>
>> Without specifying #1 and #2, "implementation-specific" becomes a
>> practical interop barrier: tables encrypted by one implementation would be
>> unreadable by another despite both being spec-compliant. These formats are
>> already versioned and frozen in Java - the spec would just be documenting
>> existing reality.
>>
>> Would there be interest in a PR for this? Happy to draft it.
>>
>> Thanks,
>> Xander
>>
>

Reply via email to