On Thu, 7 May 2026 at 19:51, Kurtis Wright <[email protected]> wrote:
> Hi Russell, > > Thank you for the quick response. I think the security use case is a great > example. I initially think of the security use case as relevant to the > Bracketing concept in a Client to remote Server Side IRC setting. > Essentially validating that what the Client sent didn't get intercepted and > changed over the wire. The durability and integrity checks are awesome > because it can give confidence that no matter if the storage/network > solution is a cloud provider or a self-hosted storage system (like CEPH or > others) you have protections against bit rot, cosmic ray > <https://en.wikipedia.org/wiki/Single-event_upset> caused bit flip, file > corruption, network errors, and more. > They perform their own checksumming in the background if they care (HDFS). What checksums in metadata is do is detecting deliberate tampering, or any accidental overwrites. For the latter, knowing a CRC32C checksum is enough and as java11+ uses the CPU's native opcodes if they have them, would be fast to validate when an audit process reads the entire file to check. HDFS puts its checksums in smaller blocks which is why there''s so many .crc files alongside data if you use it to write to the local fs; spare cluster time validates unused data and recovers from other copies/or RAID6 algorithms if that's the storage structure. I assume S3 and the cloudstores are similar. Adding checksum generation during uploads to S3 measurably hurt performance in the past, not tested recently. It is needed if you want to store data in some specific class of AWS S3 bucket. Regarding wire tampering: TLS already handles that, doesn't it? at least as between the TLS stack as the S3 load balancers. In memory, people should be using ECC DIMMs, if they can afford it ;). Steve > On Thu, May 7, 2026 at 11:23 AM Russell Spitzer <[email protected]> > wrote: > >> The last time we discussed this was in conjunction with encryption. The >> consideration would be to add something like that as additional security >> against file tampering. Every entry would essentially have it's key as well >> as additional bytes to confirm that the contents were as expected. >> >> On Thu, May 7, 2026 at 1:08 PM Kurtis Wright <[email protected]> >> wrote: >> >>> Hi Everyone, >>> >>> Kurtis from S3Tables, S3 utilizes checksums >>> <https://en.wikipedia.org/wiki/Checksum> for durability and correctness >>> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html>. >>> I see that S3 & GCS clients utilize checksumming, but after searching >>> through both the Java implementation and the mail list (just going back a >>> few months) I couldn't find any reference to something in the spec. I >>> started writing a proposal for adding checksums for durability and >>> correctness at a few different layers of Iceberg, but before I complete a >>> proposal I wanted to check with the community to gauge interest in the >>> concepts and hopefully have some initial feedback. >>> >>> The layers of Iceberg I am considering are: >>> >>> 1. At rest/storage in the file layer (metadata.json, manifest layer, >>> data file layer) >>> 2. Bracketing in the Catalog >>> 3. Maybe during compaction operations (unsure exactly how this would >>> work) >>> >>> Please let me know if there were considerations that we denied or grew >>> stale in the past. I would really appreciate reading more on what the >>> community has considered already and learn from that. Otherwise if you >>> think this is cool and want to talk or just plus one please reach out. >>> >>> -- >>> Thank You, >>> Kurtis C. Wright >>> >> > > -- > Thank You and Cheers, > Kurtis C. Wright >
