Re: [DISCUSS] CIP - Support Remote Block Spilling for Compute Engines

Keyong Zhou Sat, 21 Mar 2026 23:02:30 -0700

Hi Liam,

Thanks for the proposal! Celeborn was targeted at both shuffle and spilled
data (united to intermediate data) from day1, as the "About" in github
says: Apache Celeborn is an elastic and high-performance service for
shuffle and spilled data.


Really great to see community is raising this up.

Regards,
Keyong Zhou

Liam Hecht <[email protected]> 于2026年3月15日周日 17:02写道：

> Hi Celeborn Devs,
>
> I would like to start a discussion about a new idea for Celeborn: *Support
> Remote Block Spilling for Compute Engines*.
>
> Celeborn works very well for shuffle data, but compute engines like Spark
> still use local disks for execution spills (for example during large sorts
> or aggregations). This can be a problem on machines with limited local
> storage.
>
> The idea is to extend Celeborn so it can store these spill blocks remotely.
> When an executor runs out of memory, instead of writing to local disk, it
> would send the spilled data to Celeborn Workers.
>
> *Main points of the proposal:*
>
>    -
>
>    New RPC messages: PushSpillData and ReleaseSpill
>    -
>
>    Worker-side SpillFileManager to manage spilled blocks
>    -
>
>    Single-copy storage since the data is temporary
>    -
>
>    Reuse Celeborn’s existing network and storage components
>
> I think this could make Celeborn more useful for cloud native and diskless
> environments.
>
> Happy to share a more detailed design document if there is interest.
>
> Best regards,
> Liam Hecht
>

Re: [DISCUSS] CIP - Support Remote Block Spilling for Compute Engines

Reply via email to