I have a question on how Celeborn distributes shuffle data among Celeborn
workers.
From our observation, it seems that whenever a Celeborn worker fails or
gets killed (in a small cluster of less than 25 nodes), almost every edge
is affected. Does this mean that an edge with multiple partitions
Hi Mridul,
sorry for the late reply
Per my understanding, the key point about Spark shuffleId and
StageId/StageAttemptId is,
shuffleId is assigned at ShuffleDependency creation time and bounded to the
RDD/ShuffleDependency, while StageId/StageAttemptId is assigned and changes
at Job execution tim