Yeah, retaining the map output can reduce the needed tasks to be
recomputed for DETERMINATE stages when an output file is lost.
This is one important design tradeoff.
Currently Celeborn also supports MapPartition for Flink Batch, in
which case partition data is not aggregated, instead one
With push based shuffle in Apache Spark (magnet), we have both the map
output and reducer orientated merged output preserved - with reducer
oriented view chosen by default for reads and fallback to mapper output
when reducer output is missing/failures. That mitigates this specific issue
for
Hi Sungwoo,
What you are pointing out is very correct. Currently shuffle data
is distributed across `celeborn.master.slot.assign.maxWorkers` workers,
which defaults to 1, so I believe the cascading stage rerun will
definitely happen.
I think setting ` celeborn.master.slot.assign.maxWorkers`
I have a question on how Celeborn distributes shuffle data among Celeborn
workers.
From our observation, it seems that whenever a Celeborn worker fails or
gets killed (in a small cluster of less than 25 nodes), almost every edge
is affected. Does this mean that an edge with multiple