Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Sungwoo Park
a) If one or more tasks for a stage (and so its shuffle id) is going to be recomputed, if it is an INDETERMINATE stage, all shuffle output will be discarded and it will be entirely recomputed (see here

Question on Celeborn workers,

2023-10-12 Thread Sungwoo Park
I have a question on how Celeborn distributes shuffle data among Celeborn workers. From our observation, it seems that whenever a Celeborn worker fails or gets killed (in a small cluster of less than 25 nodes), almost every edge is affected. Does this mean that an edge with multiple

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-08 Thread Sungwoo Park
the cleanup of metadata is delayed and not guaranteed. It will be more complicated when we consider graceful restart of workers. If we want to reuse the shuffleId, we need to redesign the whole picture. Thanks, Keyong Zhou Sungwoo Park 于2023年10月2日周一 13:23?道: Hi Keyong, Instead of picking up a new

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-01 Thread Sungwoo Park
will LifecycleManager announce data lost. Thanks, Keyong Zhou Sungwoo Park 于2023年9月29日周五 22:05?道: Since the partition split has a good chance to contain data from almost all upstream mapper tasks, the cost of re-computing all upstream tasks may have little difference to re-computing the actual

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-09-29 Thread Sungwoo Park
Since the partition split has a good chance to contain data from almost all upstream mapper tasks, the cost of re-computing all upstream tasks may have little difference to re-computing the actual mapper tasks in most cases. Of course it's not always true. To change from 'complete' to