Hey Celeborn community! We (Stripe) are discovering the possibility of reducing the amount of disk spills due to external sorting on the reducer side and trying to find a partial solution to https://issues.apache.org/jira/browse/CELEBORN-1435 issue.
Currently, the idea that we have is to sort the ReducePartition file on the Celeborn worker at the write time and use K-way merge to further sort the data on the reducers. It would still require spilling some data back to Celeborn in case the number of ReducePartition files is high and we need multiple passes of external merge/sort. This description might seem pretty generic, but I'm curious to hear opinions before starting to draft the CIP! It would be nice to see whether there have been previous thoughts and interest in reducing disk usage on executors other than the jira ticket above.
