Hi Daniel & Steve,

  I appreciate the feedback. I believe my response to Russell's initial
comment portrayed my stance as indexing more than I intended into the
security aspect. I am coming from the mental model of checksumming as
a durability, integrity, and correctness tool primarily with some debatable
potential security added benefits.

I want to really +1 the CRC32C and opcode point Steve made as it is a big
reason why checksum calculation performance hit is lessened on
modern systems, though not eliminated.

What I am taking away, hopefully not incorrectly, is that I should submit a
formal proposal document to continue this discussion with more structure
even if this idea has a high bar to clear in terms of proving its
ubiquitous usefulness.

I would also be remiss to not mention early and often, I am hoping for this
to be an optional Table and/or IRC feature not something defaulted/imposed
on all customers *IF* implemented.

On Thu, May 7, 2026 at 1:15 PM Daniel Weeks <[email protected]>
wrote:

> I don't feel "security" is the right approach to justify adding checksums
> to file entries in metadata.  This may just be the wording being used, but
> "integrity" is probably closer to what we're trying to communicate.
>
> However, this distinction is partially what keeps me from thinking
> introducing checksums is helpful.  We currently track location and length
> and implementations should always use unique paths and never overwrite
> existing paths.  It is highly unlikely that a bit flip would manifest in a
> way that keeps the data consumable. The existing compressions, encodings,
> and validations make all of the random scenarios incredibly unlikely to be
> anything but transitory.  I've seen many cases where data was corrupted at
> the hardware/engine layer, but never needed a checksum in the read path to
> identify that.  The FileIO implementations perform checks on production of
> data, so the write path is reasonably covered.
>
> That leaves us with the "security" aspect, which implies some sort of
> malicious intent.  However, if someone can craft a file that meets the
> length and location requirements, they could also update the metadata
> reference and checksum.  This isn't a security feature and leads to more
> "security theater" than actual security.
>
> I don't think it adds value beyond the existing checks and validation
> performed at the FileIO layer.  So while it seems like an improvement, it
> just adds unnecessary complexity.
>
> -Dan
>
> On Thu, May 7, 2026 at 11:50 AM Kurtis Wright <[email protected]>
> wrote:
>
>> Hi Russell,
>>
>> Thank you for the quick response. I think the security use case is a
>> great example. I initially think of the security use case as relevant to
>> the Bracketing concept in a Client to remote Server Side IRC setting.
>> Essentially validating that what the Client sent didn't get intercepted and
>> changed over the wire. The durability and integrity checks are awesome
>> because it can give confidence that no matter if the storage/network
>> solution is a cloud provider or a self-hosted storage system (like CEPH or
>> others) you have protections against bit rot, cosmic ray
>> <https://en.wikipedia.org/wiki/Single-event_upset> caused bit flip, file
>> corruption, network errors, and more.
>>
>> On Thu, May 7, 2026 at 11:23 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> The last time we discussed this was in conjunction with encryption. The
>>> consideration would be to add something like that as additional security
>>> against file tampering. Every entry would essentially have it's key as well
>>> as additional bytes to confirm that the contents were as expected.
>>>
>>> On Thu, May 7, 2026 at 1:08 PM Kurtis Wright <[email protected]>
>>> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>>   Kurtis from S3Tables, S3 utilizes checksums
>>>> <https://en.wikipedia.org/wiki/Checksum> for durability and correctness
>>>> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html>.
>>>> I see that S3 & GCS clients utilize checksumming, but after searching
>>>> through both the Java implementation and the mail list (just going back a
>>>> few months) I couldn't find any reference to something in the spec. I
>>>> started writing a proposal for adding checksums for durability and
>>>> correctness at a few different layers of Iceberg, but before I complete a
>>>> proposal I wanted to check with the community to gauge interest in the
>>>> concepts and hopefully have some initial feedback.
>>>>
>>>> The layers of Iceberg I am considering are:
>>>>
>>>>    1. At rest/storage in the file layer (metadata.json, manifest
>>>>    layer, data file layer)
>>>>    2. Bracketing in the Catalog
>>>>    3. Maybe during compaction operations (unsure exactly how this
>>>>    would work)
>>>>
>>>> Please let me know if there were considerations that we denied or grew
>>>> stale in the past. I would really appreciate reading more on what the
>>>> community has considered already and learn from that. Otherwise if you
>>>> think this is cool and want to talk or just plus one please reach out.
>>>>
>>>> --
>>>> Thank You,
>>>> Kurtis C. Wright
>>>>
>>>
>>
>> --
>> Thank You and Cheers,
>> Kurtis C. Wright
>>
>

Reply via email to