Hi Marco, I appreciate the additional thoughts. Thank you for responding. As I have thought about this more my view has adjust to more focus on integrity as a holistic idea rather than just checksumming. I am curious how the dev list thinks about using xor hashes to guard against logical bugs. An example would be trying to catch a scenario where a delete is to remove 2 rows, but it removes the wrong 2 rows. I still will recommend optional checksumming of data files for end to end data correctness validation, but I think additional logic integrity checks at runtime also have a strong place in the proposal.
On Mon, May 18, 2026 at 2:00 AM Marco Kroll <[email protected]> wrote: > Hi Kurtis, all, > > Allow me to add my 2 cents. > > I like the idea to (optionally) add something like HMAC to the metadata > files to improve security. This would allow the detection of tampering by > unauthorized engines that happen to have access to the storage but not the > key, provided key management is handled outside of the storage boundaries. > > Regarding integrity I second Andrei's point, that ensuring the integrity > of the data in the format and engine are complementary. > To check the integrity of the format the current 'workaround' is to use > the checksums provided by the external storage, which is heavily > dependent on the storage provider. > > But verification throughout operations, I think, should be something that > is implemented by the respective engines and should not be part of the spec > per se. > That being said, the spec should provide the building blocks for this > functionality. > A proposal should perhaps introduce 2 (complementary?) categories: > > 1. Security (e.g., HMAC-SHA256) > 2. Integrity (e.g. CRC32 or xxHash <https://github.com/cyan4973/xxhash> > [1], > focusing on performance, rather than collision resistance) > > One small side note on integrity. Bit flips can happen due to a wide range > of issues, although faulty memory leads the pack, faulty network hardware, > OS drivers and even bad CPUs can also corrupt data silently. > > > Best > Marco > > [1]: https://github.com/cyan4973/xxhash > > > On Fri, May 8, 2026 at 12:43 PM Andrei Tserakhau via dev < > [email protected]> wrote: > >> Hi Kurtis, all, >> >> Just a couple cents from the databricks / delta side. >> >> We run an independent end-to-end integrity check on our DML / merge >> compute path in production. We do see a small but non-zero rate of >> mismatches at fleet scale that survive all the codec, length, and FileIO >> validations—they only surface because we have an independent invariant to >> compare against. ECC, TLS, and codec checks do a lot, but they aren't >> end-to-end across caches, interconnects, and heterogeneous fleets. >> >> So +1 on the proposal, especially with the opt-in framing in your last >> message—that's the right shape. Operators who have observed integrity >> issues (or who run on storage / network paths where the cloud-store >> guarantees don't fully apply) get to turn it on; everyone else pays >> nothing. Format-layer checksums and engine-layer checks are complementary, >> not redundant. >> >> + @Marco Kroll <[email protected]> here, who can go on >> empirical side deeper if this interests you >> >> Best, >> Andrei >> >> On Thu, May 7, 2026 at 10:57 PM Kurtis Wright <[email protected]> >> wrote: >> >>> Hi Daniel & Steve, >>> >>> I appreciate the feedback. I believe my response to Russell's initial >>> comment portrayed my stance as indexing more than I intended into the >>> security aspect. I am coming from the mental model of checksumming as >>> a durability, integrity, and correctness tool primarily with some debatable >>> potential security added benefits. >>> >>> I want to really +1 the CRC32C and opcode point Steve made as it is a >>> big reason why checksum calculation performance hit is lessened on >>> modern systems, though not eliminated. >>> >>> What I am taking away, hopefully not incorrectly, is that I >>> should submit a formal proposal document to continue this discussion with >>> more structure even if this idea has a high bar to clear in terms of >>> proving its ubiquitous usefulness. >>> >>> I would also be remiss to not mention early and often, I am hoping for >>> this to be an optional Table and/or IRC feature not something >>> defaulted/imposed on all customers *IF* implemented. >>> >>> On Thu, May 7, 2026 at 1:15 PM Daniel Weeks <[email protected]> >>> wrote: >>> >>>> I don't feel "security" is the right approach to justify adding >>>> checksums to file entries in metadata. This may just be the wording being >>>> used, but "integrity" is probably closer to what we're trying to >>>> communicate. >>>> >>>> However, this distinction is partially what keeps me from thinking >>>> introducing checksums is helpful. We currently track location and length >>>> and implementations should always use unique paths and never overwrite >>>> existing paths. It is highly unlikely that a bit flip would manifest in a >>>> way that keeps the data consumable. The existing compressions, encodings, >>>> and validations make all of the random scenarios incredibly unlikely to be >>>> anything but transitory. I've seen many cases where data was corrupted at >>>> the hardware/engine layer, but never needed a checksum in the read path to >>>> identify that. The FileIO implementations perform checks on production of >>>> data, so the write path is reasonably covered. >>>> >>>> That leaves us with the "security" aspect, which implies some sort of >>>> malicious intent. However, if someone can craft a file that meets the >>>> length and location requirements, they could also update the metadata >>>> reference and checksum. This isn't a security feature and leads to more >>>> "security theater" than actual security. >>>> >>>> I don't think it adds value beyond the existing checks and validation >>>> performed at the FileIO layer. So while it seems like an improvement, it >>>> just adds unnecessary complexity. >>>> >>>> -Dan >>>> >>>> On Thu, May 7, 2026 at 11:50 AM Kurtis Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Russell, >>>>> >>>>> Thank you for the quick response. I think the security use case is a >>>>> great example. I initially think of the security use case as relevant to >>>>> the Bracketing concept in a Client to remote Server Side IRC setting. >>>>> Essentially validating that what the Client sent didn't get intercepted >>>>> and >>>>> changed over the wire. The durability and integrity checks are awesome >>>>> because it can give confidence that no matter if the storage/network >>>>> solution is a cloud provider or a self-hosted storage system (like CEPH or >>>>> others) you have protections against bit rot, cosmic ray >>>>> <https://en.wikipedia.org/wiki/Single-event_upset> caused bit flip, >>>>> file corruption, network errors, and more. >>>>> >>>>> On Thu, May 7, 2026 at 11:23 AM Russell Spitzer < >>>>> [email protected]> wrote: >>>>> >>>>>> The last time we discussed this was in conjunction with encryption. >>>>>> The consideration would be to add something like that as additional >>>>>> security against file tampering. Every entry would essentially have it's >>>>>> key as well as additional bytes to confirm that the contents were as >>>>>> expected. >>>>>> >>>>>> On Thu, May 7, 2026 at 1:08 PM Kurtis Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> Kurtis from S3Tables, S3 utilizes checksums >>>>>>> <https://en.wikipedia.org/wiki/Checksum> for durability and >>>>>>> correctness >>>>>>> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html>. >>>>>>> I see that S3 & GCS clients utilize checksumming, but after searching >>>>>>> through both the Java implementation and the mail list (just going back >>>>>>> a >>>>>>> few months) I couldn't find any reference to something in the spec. I >>>>>>> started writing a proposal for adding checksums for durability and >>>>>>> correctness at a few different layers of Iceberg, but before I complete >>>>>>> a >>>>>>> proposal I wanted to check with the community to gauge interest in the >>>>>>> concepts and hopefully have some initial feedback. >>>>>>> >>>>>>> The layers of Iceberg I am considering are: >>>>>>> >>>>>>> 1. At rest/storage in the file layer (metadata.json, manifest >>>>>>> layer, data file layer) >>>>>>> 2. Bracketing in the Catalog >>>>>>> 3. Maybe during compaction operations (unsure exactly how this >>>>>>> would work) >>>>>>> >>>>>>> Please let me know if there were considerations that we denied or >>>>>>> grew stale in the past. I would really appreciate reading more on what >>>>>>> the >>>>>>> community has considered already and learn from that. Otherwise if you >>>>>>> think this is cool and want to talk or just plus one please reach out. >>>>>>> >>>>>>> -- >>>>>>> Thank You, >>>>>>> Kurtis C. Wright >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Thank You and Cheers, >>>>> Kurtis C. Wright >>>>> >>>>
