Question on Celeborn workers,

2023-10-12 Thread Sungwoo Park
I have a question on how Celeborn distributes shuffle data among Celeborn workers. From our observation, it seems that whenever a Celeborn worker fails or gets killed (in a small cluster of less than 25 nodes), almost every edge is affected. Does this mean that an edge with multiple partitions

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-12 Thread Erik fang
Hi Mridul, sorry for the late reply Per my understanding, the key point about Spark shuffleId and StageId/StageAttemptId is, shuffleId is assigned at ShuffleDependency creation time and bounded to the RDD/ShuffleDependency, while StageId/StageAttemptId is assigned and changes at Job execution tim