Re: [DISCUSS] At-rest encryption for shuffle data on workers and tiered storage

Aravind Patnam Mon, 20 Apr 2026 13:18:02 -0700

Let me start a thread for a CIP later this week.

Aravind K. Patnam



On Mon, Apr 20, 2026 at 8:37 AM Mridul Muralidharan <[email protected]>
wrote:

> Hi,
>
>   This is pretty much what Aravind has implemented internally :-)
>
> Regards,
> Mridul
>
> On Mon, Apr 20, 2026 at 8:34 AM rexxiong <[email protected]> wrote:
>
> > Hi Karthik,
> >
> > Thanks for the thorough proposal. I'd like to suggest an alternative
> > approach: client-side encryption, where encryption/decryption happens
> > entirely in the Spark
> > client — similar to how compression works today. Celeborn workers and
> > master would only ever see ciphertext.
> >
> > Rationale:
> >
> > 1. Simpler architecture — no worker write/read path changes. Celeborn
> > remains a stateless byte pipe, just like it is for compression.
> > 2. Stronger security — Plaintext and encryption keys never reach workers
> or
> > master. The trust boundary stays within the client, eliminating the need
> > for KMS
> > credentials on the server side.
> > 3. No sendfile regression — Since workers store ciphertext natively,
> > CELEBORN-2301's zero-copy sendfile works unchanged for all workloads,
> > encrypted or not.
> > 4. Aligns with Spark — This naturally respects
> spark.io.encryption.enabled
> > and we can reuse Spark's existing key distribution via IOEncryptionKey.
> >
> > Happy to discuss further.
> >
> > Regards,
> > Jiashu Xiong
> >
> > Aravind Patnam <[email protected]> 于2026年4月20日周一 05:08写道：
> >
> > > Hi,
> > >
> > > We already have EAR for celeborn shuffle data internally at LinkedIn,
> > where
> > > we have added this support to respect the existing
> > > spark.io.encryption.enabled config in Spark on the client side.
> > >
> > > I am happy to contribute this back and start a CIP for this next week.
> > >
> > > Thanks,
> > > Aravind
> > >
> > >
> > >
> > > On Sat, Apr 18, 2026 at 10:24 PM Karthik Prabhakar <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi dev@,
> > > >
> > > > I’d like to propose adding at-rest encryption for shuffle data in
> > > Celeborn
> > > > and would appreciate the community’s input before writing a full
> > > > implementation.
> > > > cURRENT gap
> > > >
> > > > Celeborn encrypts data in transit (TLS, SASL) but not at rest. When a
> > > > worker flushes shuffle data to local disk, HDFS, S3, or OSS, the
> bytes
> > > land
> > > > as plaintext.
> > > >
> > > > The only write site for local disk is LocalFlushTask.flush() in
> > > > FlushTask.scala (L66, L71 at commit a56f69a), which calls
> > > > fileChannel.write(buffer) with no cipher transform. The
> tiered-storage
> > > > paths (HdfsFlushTask, S3FlushTask, OssFlushTask) are the same — raw
> > bytes
> > > > to the underlying store.
> > > >
> > > > Verified with:
> > > >
> > > > grep -rnE 'cipher|\.encrypt|aes|envelope' worker/src/main/
> > > > grep -rn  'javax\.crypto'                 worker/src/main/
> > > > (both zero matches)
> > > >
> > > > This matters because spark.io.encryption.enabled does *not* cover the
> > > > Celeborn path. When Celeborn’s ShuffleManager replaces Spark’s
> shuffle
> > > > writer, Spark’s encryption key is never consulted — confirmed by
> > grepping
> > > > client-spark/ for IOEncryptionKey (zero matches).
> > > >
> > > > Teams adopting Celeborn for performance silently lose
> > shuffle-encryption
> > > > guarantees their compliance posture may assume.
> > > > Who Needs This
> > > >
> > > >    - Regulated industries (healthcare, finance, public sector) whose
> > > >    auditors require application-layer encryption independent of
> > > disk/volume
> > > >    encryption.
> > > >    - Multi-tenant platforms needing cryptographic isolation between
> > > tenants
> > > >    on shared workers.
> > > >    - Teams using object-store tiering who want encryption before
> > offload.
> > > >
> > > > Proposed Approach (High Level)
> > > >
> > > >    1. A *StreamCipher SPI* in common/ for wrapping
> WritableByteChannel
> > /
> > > >    ReadableByteChannel with encrypt/decrypt. No KMS SDK in core.
> > > >    2. A *KeyService SPI* for envelope encryption — generate/unwrap
> DEKs
> > > >    using a KMS-held KEK. Implementations live in separate optional
> > > modules
> > > > (
> > > >    aws-kms, gcp-kms, azure-kv, vault, static for dev/PoC).
> > > >    3. Wire into the worker write path: LocalFlushTask wraps
> fileChannel
> > > >     with StreamCipher.wrapForWrite(). Same for HDFS/S3/OSS flush
> tasks.
> > > >    4. Wire into the reader path: LocalPartitionDataReader detects a
> > > 16-byte
> > > >    encrypted-file header, unwraps the DEK (cached per
> worker+shuffle),
> > > > wraps
> > > >    the channel with StreamCipher.wrapForRead().
> > > >    5. Opt-in via celeborn.shuffle.io.encryption.enabled=true.
> Default
> > > off.
> > > >    Unencrypted deployments are byte-identical to today, zero
> overhead.
> > > >    6. Per-shuffle DEKs by default (one KMS call per shuffle
> > reservation,
> > > >    amortized). Per-application DEK scope as an option.
> > > >
> > > > Interaction with Recent Work
> > > >
> > > > CELEBORN-2301 (commit 95419e1) recently landed enhanced zero-copy
> > > sendfile
> > > > for FileRegion on native transports — a nice throughput win for the
> > fetch
> > > > path.
> > > >
> > > > Encryption and sendfile are fundamentally incompatible: sendfile(2)
> > > cannot
> > > > transform bytes, so encrypted partitions must use a buffered read
> path.
> > > > This is only relevant for encrypted workloads; unencrypted workloads
> on
> > > the
> > > > same cluster keep the full CELEBORN-2301 benefit. Per-application
> > > > encryption flags (not per-cluster) would let encrypted and
> unencrypted
> > > apps
> > > > coexist without regressing the latter.
> > > > Questions for the Community
> > > >
> > > > Trimming to three since these are the ones I’d need opinions on
> before
> > > > writing code. Happy to take the rest up in follow-ups.
> > > >
> > > >    - Any prior design work or internal discussion on this topic I
> > should
> > > >    know about before proceeding?
> > > >    - *Per-shuffle vs. per-application DEK scope* as the default?
> > > >    Per-shuffle gives smaller blast radius and simpler lifecycle;
> > > >    per-application amortizes KMS round-trips and is friendlier for
> > > >    long-running jobs.
> > > >    - *Key distribution path:* wrapped DEKs flow through Master
> metadata
> > > >    (simpler, one KMS-aware role) vs. workers unwrap directly from KMS
> > > > (removes
> > > >    Master from the key path, but every worker needs KMS credentials).
> > > >    Preference?
> > > >
> > > > Tracking
> > > >
> > > > JIRA: CELEBORN-2311 <
> > https://issues.apache.org/jira/browse/CELEBORN-2311
> > > >
> > > >
> > > > I have a detailed design document with source citations, threat
> model,
> > > > performance analysis, and phased implementation plan. Happy to share
> > > > on-list or off-list if there’s interest.
> > > >
> > > > - Karthik
> > > >
> > >
> > >
> > > --
> > > Aravind K. Patnam
> > >
> >
>

Re: [DISCUSS] At-rest encryption for shuffle data on workers and tiered storage

Reply via email to