Re: [DISCUSS] At-rest encryption for shuffle data on workers and tiered storage

rexxiong Mon, 20 Apr 2026 06:35:01 -0700

Hi Karthik,

Thanks for the thorough proposal. I'd like to suggest an alternative
approach: client-side encryption, where encryption/decryption happens
entirely in the Spark
client — similar to how compression works today. Celeborn workers and
master would only ever see ciphertext.


Rationale:

1. Simpler architecture — no worker write/read path changes. Celeborn
remains a stateless byte pipe, just like it is for compression.
2. Stronger security — Plaintext and encryption keys never reach workers or
master. The trust boundary stays within the client, eliminating the need
for KMS
credentials on the server side.
3. No sendfile regression — Since workers store ciphertext natively,
CELEBORN-2301's zero-copy sendfile works unchanged for all workloads,
encrypted or not.
4. Aligns with Spark — This naturally respects spark.io.encryption.enabled
and we can reuse Spark's existing key distribution via IOEncryptionKey.

Happy to discuss further.

Regards,
Jiashu Xiong

Aravind Patnam <[email protected]> 于2026年4月20日周一 05:08写道：

> Hi,
>
> We already have EAR for celeborn shuffle data internally at LinkedIn, where
> we have added this support to respect the existing
> spark.io.encryption.enabled config in Spark on the client side.
>
> I am happy to contribute this back and start a CIP for this next week.
>
> Thanks,
> Aravind
>
>
>
> On Sat, Apr 18, 2026 at 10:24 PM Karthik Prabhakar <[email protected]>
> wrote:
>
> > Hi dev@,
> >
> > I’d like to propose adding at-rest encryption for shuffle data in
> Celeborn
> > and would appreciate the community’s input before writing a full
> > implementation.
> > cURRENT gap
> >
> > Celeborn encrypts data in transit (TLS, SASL) but not at rest. When a
> > worker flushes shuffle data to local disk, HDFS, S3, or OSS, the bytes
> land
> > as plaintext.
> >
> > The only write site for local disk is LocalFlushTask.flush() in
> > FlushTask.scala (L66, L71 at commit a56f69a), which calls
> > fileChannel.write(buffer) with no cipher transform. The tiered-storage
> > paths (HdfsFlushTask, S3FlushTask, OssFlushTask) are the same — raw bytes
> > to the underlying store.
> >
> > Verified with:
> >
> > grep -rnE 'cipher|\.encrypt|aes|envelope' worker/src/main/
> > grep -rn  'javax\.crypto'                 worker/src/main/
> > (both zero matches)
> >
> > This matters because spark.io.encryption.enabled does *not* cover the
> > Celeborn path. When Celeborn’s ShuffleManager replaces Spark’s shuffle
> > writer, Spark’s encryption key is never consulted — confirmed by grepping
> > client-spark/ for IOEncryptionKey (zero matches).
> >
> > Teams adopting Celeborn for performance silently lose shuffle-encryption
> > guarantees their compliance posture may assume.
> > Who Needs This
> >
> >    - Regulated industries (healthcare, finance, public sector) whose
> >    auditors require application-layer encryption independent of
> disk/volume
> >    encryption.
> >    - Multi-tenant platforms needing cryptographic isolation between
> tenants
> >    on shared workers.
> >    - Teams using object-store tiering who want encryption before offload.
> >
> > Proposed Approach (High Level)
> >
> >    1. A *StreamCipher SPI* in common/ for wrapping WritableByteChannel /
> >    ReadableByteChannel with encrypt/decrypt. No KMS SDK in core.
> >    2. A *KeyService SPI* for envelope encryption — generate/unwrap DEKs
> >    using a KMS-held KEK. Implementations live in separate optional
> modules
> > (
> >    aws-kms, gcp-kms, azure-kv, vault, static for dev/PoC).
> >    3. Wire into the worker write path: LocalFlushTask wraps fileChannel
> >     with StreamCipher.wrapForWrite(). Same for HDFS/S3/OSS flush tasks.
> >    4. Wire into the reader path: LocalPartitionDataReader detects a
> 16-byte
> >    encrypted-file header, unwraps the DEK (cached per worker+shuffle),
> > wraps
> >    the channel with StreamCipher.wrapForRead().
> >    5. Opt-in via celeborn.shuffle.io.encryption.enabled=true. Default
> off.
> >    Unencrypted deployments are byte-identical to today, zero overhead.
> >    6. Per-shuffle DEKs by default (one KMS call per shuffle reservation,
> >    amortized). Per-application DEK scope as an option.
> >
> > Interaction with Recent Work
> >
> > CELEBORN-2301 (commit 95419e1) recently landed enhanced zero-copy
> sendfile
> > for FileRegion on native transports — a nice throughput win for the fetch
> > path.
> >
> > Encryption and sendfile are fundamentally incompatible: sendfile(2)
> cannot
> > transform bytes, so encrypted partitions must use a buffered read path.
> > This is only relevant for encrypted workloads; unencrypted workloads on
> the
> > same cluster keep the full CELEBORN-2301 benefit. Per-application
> > encryption flags (not per-cluster) would let encrypted and unencrypted
> apps
> > coexist without regressing the latter.
> > Questions for the Community
> >
> > Trimming to three since these are the ones I’d need opinions on before
> > writing code. Happy to take the rest up in follow-ups.
> >
> >    - Any prior design work or internal discussion on this topic I should
> >    know about before proceeding?
> >    - *Per-shuffle vs. per-application DEK scope* as the default?
> >    Per-shuffle gives smaller blast radius and simpler lifecycle;
> >    per-application amortizes KMS round-trips and is friendlier for
> >    long-running jobs.
> >    - *Key distribution path:* wrapped DEKs flow through Master metadata
> >    (simpler, one KMS-aware role) vs. workers unwrap directly from KMS
> > (removes
> >    Master from the key path, but every worker needs KMS credentials).
> >    Preference?
> >
> > Tracking
> >
> > JIRA: CELEBORN-2311 <https://issues.apache.org/jira/browse/CELEBORN-2311
> >
> >
> > I have a detailed design document with source citations, threat model,
> > performance analysis, and phased implementation plan. Happy to share
> > on-list or off-list if there’s interest.
> >
> > - Karthik
> >
>
>
> --
> Aravind K. Patnam
>

Re: [DISCUSS] At-rest encryption for shuffle data on workers and tiered storage

Reply via email to