Hi Daniel & Steve, I appreciate the feedback. I believe my response to Russell's initial comment portrayed my stance as indexing more than I intended into the security aspect. I am coming from the mental model of checksumming as a durability, integrity, and correctness tool primarily with some debatable potential security added benefits.
I want to really +1 the CRC32C and opcode point Steve made as it is a big reason why checksum calculation performance hit is lessened on modern systems, though not eliminated. What I am taking away, hopefully not incorrectly, is that I should submit a formal proposal document to continue this discussion with more structure even if this idea has a high bar to clear in terms of proving its ubiquitous usefulness. I would also be remiss to not mention early and often, I am hoping for this to be an optional Table and/or IRC feature not something defaulted/imposed on all customers *IF* implemented. On Thu, May 7, 2026 at 1:15 PM Daniel Weeks <[email protected]> wrote: > I don't feel "security" is the right approach to justify adding checksums > to file entries in metadata. This may just be the wording being used, but > "integrity" is probably closer to what we're trying to communicate. > > However, this distinction is partially what keeps me from thinking > introducing checksums is helpful. We currently track location and length > and implementations should always use unique paths and never overwrite > existing paths. It is highly unlikely that a bit flip would manifest in a > way that keeps the data consumable. The existing compressions, encodings, > and validations make all of the random scenarios incredibly unlikely to be > anything but transitory. I've seen many cases where data was corrupted at > the hardware/engine layer, but never needed a checksum in the read path to > identify that. The FileIO implementations perform checks on production of > data, so the write path is reasonably covered. > > That leaves us with the "security" aspect, which implies some sort of > malicious intent. However, if someone can craft a file that meets the > length and location requirements, they could also update the metadata > reference and checksum. This isn't a security feature and leads to more > "security theater" than actual security. > > I don't think it adds value beyond the existing checks and validation > performed at the FileIO layer. So while it seems like an improvement, it > just adds unnecessary complexity. > > -Dan > > On Thu, May 7, 2026 at 11:50 AM Kurtis Wright <[email protected]> > wrote: > >> Hi Russell, >> >> Thank you for the quick response. I think the security use case is a >> great example. I initially think of the security use case as relevant to >> the Bracketing concept in a Client to remote Server Side IRC setting. >> Essentially validating that what the Client sent didn't get intercepted and >> changed over the wire. The durability and integrity checks are awesome >> because it can give confidence that no matter if the storage/network >> solution is a cloud provider or a self-hosted storage system (like CEPH or >> others) you have protections against bit rot, cosmic ray >> <https://en.wikipedia.org/wiki/Single-event_upset> caused bit flip, file >> corruption, network errors, and more. >> >> On Thu, May 7, 2026 at 11:23 AM Russell Spitzer < >> [email protected]> wrote: >> >>> The last time we discussed this was in conjunction with encryption. The >>> consideration would be to add something like that as additional security >>> against file tampering. Every entry would essentially have it's key as well >>> as additional bytes to confirm that the contents were as expected. >>> >>> On Thu, May 7, 2026 at 1:08 PM Kurtis Wright <[email protected]> >>> wrote: >>> >>>> Hi Everyone, >>>> >>>> Kurtis from S3Tables, S3 utilizes checksums >>>> <https://en.wikipedia.org/wiki/Checksum> for durability and correctness >>>> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html>. >>>> I see that S3 & GCS clients utilize checksumming, but after searching >>>> through both the Java implementation and the mail list (just going back a >>>> few months) I couldn't find any reference to something in the spec. I >>>> started writing a proposal for adding checksums for durability and >>>> correctness at a few different layers of Iceberg, but before I complete a >>>> proposal I wanted to check with the community to gauge interest in the >>>> concepts and hopefully have some initial feedback. >>>> >>>> The layers of Iceberg I am considering are: >>>> >>>> 1. At rest/storage in the file layer (metadata.json, manifest >>>> layer, data file layer) >>>> 2. Bracketing in the Catalog >>>> 3. Maybe during compaction operations (unsure exactly how this >>>> would work) >>>> >>>> Please let me know if there were considerations that we denied or grew >>>> stale in the past. I would really appreciate reading more on what the >>>> community has considered already and learn from that. Otherwise if you >>>> think this is cool and want to talk or just plus one please reach out. >>>> >>>> -- >>>> Thank You, >>>> Kurtis C. Wright >>>> >>> >> >> -- >> Thank You and Cheers, >> Kurtis C. Wright >> >
