Re: [Curiosity] Checksums in Iceberg Libraries

Kurtis Wright Fri, 22 May 2026 11:16:02 -0700

Hi Marco,

I appreciate the additional thoughts. Thank you for responding. As I have
thought about this more my view has adjust to more focus on integrity as a
holistic idea rather than just checksumming. I am curious how the dev list
thinks about using xor hashes to guard against logical bugs. An example
would be trying to catch a scenario where a delete is to remove 2 rows, but
it removes the wrong 2 rows. I still will recommend optional checksumming
of data files for end to end data correctness validation, but I think
additional logic integrity checks at runtime also have a strong place in
the proposal.


On Mon, May 18, 2026 at 2:00 AM Marco Kroll <[email protected]> wrote:

> Hi Kurtis, all,
>
> Allow me to add my 2 cents.
>
> I like the idea to (optionally) add something like HMAC to the metadata
> files to improve security. This would allow the detection of tampering by
> unauthorized engines that happen to have access to the storage but not the
> key, provided key management is handled outside of the storage boundaries.
>
> Regarding integrity I second Andrei's point, that ensuring the integrity
> of the data in the format and engine are complementary.
> To check the integrity of the format the current 'workaround' is to use
> the checksums provided by the external storage, which is heavily
> dependent on the storage provider.
>
> But verification throughout operations, I think, should be something that
> is implemented by the respective engines and should not be part of the spec
> per se.
> That being said, the spec should provide the building blocks for this
> functionality.
> A proposal should perhaps introduce 2 (complementary?) categories:
>
>    1. Security (e.g., HMAC-SHA256)
>    2. Integrity (e.g. CRC32 or xxHash <https://github.com/cyan4973/xxhash> 
> [1],
>    focusing on performance, rather than collision resistance)
>
> One small side note on integrity. Bit flips can happen due to a wide range
> of issues, although faulty memory leads the pack, faulty network hardware,
> OS drivers and even bad CPUs can also corrupt data silently.
>
>
> Best
> Marco
>
> [1]: https://github.com/cyan4973/xxhash
>
>
> On Fri, May 8, 2026 at 12:43 PM Andrei Tserakhau via dev <
> [email protected]> wrote:
>
>> Hi Kurtis, all,
>>
>> Just a couple cents from the databricks / delta side.
>>
>> We run an independent end-to-end integrity check on our DML / merge
>> compute path in production. We do see a small but non-zero rate of
>> mismatches at fleet scale that survive all the codec, length, and FileIO
>> validations—they only surface because we have an independent invariant to
>> compare against. ECC, TLS, and codec checks do a lot, but they aren't
>> end-to-end across caches, interconnects, and heterogeneous fleets.
>>
>> So +1 on the proposal, especially with the opt-in framing in your last
>> message—that's the right shape. Operators who have observed integrity
>> issues (or who run on storage / network paths where the cloud-store
>> guarantees don't fully apply) get to turn it on; everyone else pays
>> nothing. Format-layer checksums and engine-layer checks are complementary,
>> not redundant.
>>
>> + @Marco Kroll <[email protected]> here, who can go on
>> empirical side deeper if this interests you
>>
>> Best,
>> Andrei
>>
>> On Thu, May 7, 2026 at 10:57 PM Kurtis Wright <[email protected]>
>> wrote:
>>
>>> Hi Daniel & Steve,
>>>
>>>   I appreciate the feedback. I believe my response to Russell's initial
>>> comment portrayed my stance as indexing more than I intended into the
>>> security aspect. I am coming from the mental model of checksumming as
>>> a durability, integrity, and correctness tool primarily with some debatable
>>> potential security added benefits.
>>>
>>> I want to really +1 the CRC32C and opcode point Steve made as it is a
>>> big reason why checksum calculation performance hit is lessened on
>>> modern systems, though not eliminated.
>>>
>>> What I am taking away, hopefully not incorrectly, is that I
>>> should submit a formal proposal document to continue this discussion with
>>> more structure even if this idea has a high bar to clear in terms of
>>> proving its ubiquitous usefulness.
>>>
>>> I would also be remiss to not mention early and often, I am hoping for
>>> this to be an optional Table and/or IRC feature not something
>>> defaulted/imposed on all customers *IF* implemented.
>>>
>>> On Thu, May 7, 2026 at 1:15 PM Daniel Weeks <[email protected]>
>>> wrote:
>>>
>>>> I don't feel "security" is the right approach to justify adding
>>>> checksums to file entries in metadata.  This may just be the wording being
>>>> used, but "integrity" is probably closer to what we're trying to
>>>> communicate.
>>>>
>>>> However, this distinction is partially what keeps me from thinking
>>>> introducing checksums is helpful.  We currently track location and length
>>>> and implementations should always use unique paths and never overwrite
>>>> existing paths.  It is highly unlikely that a bit flip would manifest in a
>>>> way that keeps the data consumable. The existing compressions, encodings,
>>>> and validations make all of the random scenarios incredibly unlikely to be
>>>> anything but transitory.  I've seen many cases where data was corrupted at
>>>> the hardware/engine layer, but never needed a checksum in the read path to
>>>> identify that.  The FileIO implementations perform checks on production of
>>>> data, so the write path is reasonably covered.
>>>>
>>>> That leaves us with the "security" aspect, which implies some sort of
>>>> malicious intent.  However, if someone can craft a file that meets the
>>>> length and location requirements, they could also update the metadata
>>>> reference and checksum.  This isn't a security feature and leads to more
>>>> "security theater" than actual security.
>>>>
>>>> I don't think it adds value beyond the existing checks and validation
>>>> performed at the FileIO layer.  So while it seems like an improvement, it
>>>> just adds unnecessary complexity.
>>>>
>>>> -Dan
>>>>
>>>> On Thu, May 7, 2026 at 11:50 AM Kurtis Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Russell,
>>>>>
>>>>> Thank you for the quick response. I think the security use case is a
>>>>> great example. I initially think of the security use case as relevant to
>>>>> the Bracketing concept in a Client to remote Server Side IRC setting.
>>>>> Essentially validating that what the Client sent didn't get intercepted 
>>>>> and
>>>>> changed over the wire. The durability and integrity checks are awesome
>>>>> because it can give confidence that no matter if the storage/network
>>>>> solution is a cloud provider or a self-hosted storage system (like CEPH or
>>>>> others) you have protections against bit rot, cosmic ray
>>>>> <https://en.wikipedia.org/wiki/Single-event_upset> caused bit flip,
>>>>> file corruption, network errors, and more.
>>>>>
>>>>> On Thu, May 7, 2026 at 11:23 AM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The last time we discussed this was in conjunction with encryption.
>>>>>> The consideration would be to add something like that as additional
>>>>>> security against file tampering. Every entry would essentially have it's
>>>>>> key as well as additional bytes to confirm that the contents were as
>>>>>> expected.
>>>>>>
>>>>>> On Thu, May 7, 2026 at 1:08 PM Kurtis Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>>   Kurtis from S3Tables, S3 utilizes checksums
>>>>>>> <https://en.wikipedia.org/wiki/Checksum> for durability and
>>>>>>> correctness
>>>>>>> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html>.
>>>>>>> I see that S3 & GCS clients utilize checksumming, but after searching
>>>>>>> through both the Java implementation and the mail list (just going back 
>>>>>>> a
>>>>>>> few months) I couldn't find any reference to something in the spec. I
>>>>>>> started writing a proposal for adding checksums for durability and
>>>>>>> correctness at a few different layers of Iceberg, but before I complete 
>>>>>>> a
>>>>>>> proposal I wanted to check with the community to gauge interest in the
>>>>>>> concepts and hopefully have some initial feedback.
>>>>>>>
>>>>>>> The layers of Iceberg I am considering are:
>>>>>>>
>>>>>>>    1. At rest/storage in the file layer (metadata.json, manifest
>>>>>>>    layer, data file layer)
>>>>>>>    2. Bracketing in the Catalog
>>>>>>>    3. Maybe during compaction operations (unsure exactly how this
>>>>>>>    would work)
>>>>>>>
>>>>>>> Please let me know if there were considerations that we denied or
>>>>>>> grew stale in the past. I would really appreciate reading more on what 
>>>>>>> the
>>>>>>> community has considered already and learn from that. Otherwise if you
>>>>>>> think this is cool and want to talk or just plus one please reach out.
>>>>>>>
>>>>>>> --
>>>>>>> Thank You,
>>>>>>> Kurtis C. Wright
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Thank You and Cheers,
>>>>> Kurtis C. Wright
>>>>>
>>>>

Re: [Curiosity] Checksums in Iceberg Libraries

Reply via email to