Hi Karthik, Thanks for the thorough proposal. I'd like to suggest an alternative approach: client-side encryption, where encryption/decryption happens entirely in the Spark client — similar to how compression works today. Celeborn workers and master would only ever see ciphertext.
Rationale: 1. Simpler architecture — no worker write/read path changes. Celeborn remains a stateless byte pipe, just like it is for compression. 2. Stronger security — Plaintext and encryption keys never reach workers or master. The trust boundary stays within the client, eliminating the need for KMS credentials on the server side. 3. No sendfile regression — Since workers store ciphertext natively, CELEBORN-2301's zero-copy sendfile works unchanged for all workloads, encrypted or not. 4. Aligns with Spark — This naturally respects spark.io.encryption.enabled and we can reuse Spark's existing key distribution via IOEncryptionKey. Happy to discuss further. Regards, Jiashu Xiong Aravind Patnam <[email protected]> 于2026年4月20日周一 05:08写道: > Hi, > > We already have EAR for celeborn shuffle data internally at LinkedIn, where > we have added this support to respect the existing > spark.io.encryption.enabled config in Spark on the client side. > > I am happy to contribute this back and start a CIP for this next week. > > Thanks, > Aravind > > > > On Sat, Apr 18, 2026 at 10:24 PM Karthik Prabhakar <[email protected]> > wrote: > > > Hi dev@, > > > > I’d like to propose adding at-rest encryption for shuffle data in > Celeborn > > and would appreciate the community’s input before writing a full > > implementation. > > cURRENT gap > > > > Celeborn encrypts data in transit (TLS, SASL) but not at rest. When a > > worker flushes shuffle data to local disk, HDFS, S3, or OSS, the bytes > land > > as plaintext. > > > > The only write site for local disk is LocalFlushTask.flush() in > > FlushTask.scala (L66, L71 at commit a56f69a), which calls > > fileChannel.write(buffer) with no cipher transform. The tiered-storage > > paths (HdfsFlushTask, S3FlushTask, OssFlushTask) are the same — raw bytes > > to the underlying store. > > > > Verified with: > > > > grep -rnE 'cipher|\.encrypt|aes|envelope' worker/src/main/ > > grep -rn 'javax\.crypto' worker/src/main/ > > (both zero matches) > > > > This matters because spark.io.encryption.enabled does *not* cover the > > Celeborn path. When Celeborn’s ShuffleManager replaces Spark’s shuffle > > writer, Spark’s encryption key is never consulted — confirmed by grepping > > client-spark/ for IOEncryptionKey (zero matches). > > > > Teams adopting Celeborn for performance silently lose shuffle-encryption > > guarantees their compliance posture may assume. > > Who Needs This > > > > - Regulated industries (healthcare, finance, public sector) whose > > auditors require application-layer encryption independent of > disk/volume > > encryption. > > - Multi-tenant platforms needing cryptographic isolation between > tenants > > on shared workers. > > - Teams using object-store tiering who want encryption before offload. > > > > Proposed Approach (High Level) > > > > 1. A *StreamCipher SPI* in common/ for wrapping WritableByteChannel / > > ReadableByteChannel with encrypt/decrypt. No KMS SDK in core. > > 2. A *KeyService SPI* for envelope encryption — generate/unwrap DEKs > > using a KMS-held KEK. Implementations live in separate optional > modules > > ( > > aws-kms, gcp-kms, azure-kv, vault, static for dev/PoC). > > 3. Wire into the worker write path: LocalFlushTask wraps fileChannel > > with StreamCipher.wrapForWrite(). Same for HDFS/S3/OSS flush tasks. > > 4. Wire into the reader path: LocalPartitionDataReader detects a > 16-byte > > encrypted-file header, unwraps the DEK (cached per worker+shuffle), > > wraps > > the channel with StreamCipher.wrapForRead(). > > 5. Opt-in via celeborn.shuffle.io.encryption.enabled=true. Default > off. > > Unencrypted deployments are byte-identical to today, zero overhead. > > 6. Per-shuffle DEKs by default (one KMS call per shuffle reservation, > > amortized). Per-application DEK scope as an option. > > > > Interaction with Recent Work > > > > CELEBORN-2301 (commit 95419e1) recently landed enhanced zero-copy > sendfile > > for FileRegion on native transports — a nice throughput win for the fetch > > path. > > > > Encryption and sendfile are fundamentally incompatible: sendfile(2) > cannot > > transform bytes, so encrypted partitions must use a buffered read path. > > This is only relevant for encrypted workloads; unencrypted workloads on > the > > same cluster keep the full CELEBORN-2301 benefit. Per-application > > encryption flags (not per-cluster) would let encrypted and unencrypted > apps > > coexist without regressing the latter. > > Questions for the Community > > > > Trimming to three since these are the ones I’d need opinions on before > > writing code. Happy to take the rest up in follow-ups. > > > > - Any prior design work or internal discussion on this topic I should > > know about before proceeding? > > - *Per-shuffle vs. per-application DEK scope* as the default? > > Per-shuffle gives smaller blast radius and simpler lifecycle; > > per-application amortizes KMS round-trips and is friendlier for > > long-running jobs. > > - *Key distribution path:* wrapped DEKs flow through Master metadata > > (simpler, one KMS-aware role) vs. workers unwrap directly from KMS > > (removes > > Master from the key path, but every worker needs KMS credentials). > > Preference? > > > > Tracking > > > > JIRA: CELEBORN-2311 <https://issues.apache.org/jira/browse/CELEBORN-2311 > > > > > > I have a detailed design document with source citations, threat model, > > performance analysis, and phased implementation plan. Happy to share > > on-list or off-list if there’s interest. > > > > - Karthik > > > > > -- > Aravind K. Patnam >
