Question on Celeborn workers,

Sungwoo Park Thu, 12 Oct 2023 11:19:24 -0700

I have a question on how Celeborn distributes shuffle data among Celebornworkers.

From our observation, it seems that whenever a Celeborn worker fails or

gets killed (in a small cluster of less than 25 nodes), almost every edgeis affected. Does this mean that an edge with multiple partitions usuallydistributes its shuffle data among all Celeborn workers?

If this is the case, I think stage recomputation is unnecessary and justre-executing the entire DAG is a better approach. Our currentimplementation uses the following scheme for stage recomputation:

1. If a read failure occurs for shuffleId #1 for an edge, we pick up a newshuffleId #2 for the same edge.2. The upstream stage re-executes all tasks, but writes the output toshuffleId #2.

3. Tasks in the downstream stage re-try by reading from shuffleId #2.

From our experiment, whenever a Celeborn worker fails and a read failure

occurs for an edge, the re-execution of the upstream stage usally ends upwith another read failure because some part of its input has also beenlost. As a result, all upstream stages are eventually re-executed in acascading manner. In essence, the failure of a Celeborn worker invalidatesall existing shuffleIds.

(This is what we observed with Hive-MR3-Celeborn, but I guess stagerecomputation in Spark will have to deal with the same problem.)


Thanks,

--- Sungwoo

Question on Celeborn workers,

Reply via email to