Copilot commented on code in PR #3640:
URL: https://github.com/apache/celeborn/pull/3640#discussion_r3020483187
##########
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -459,6 +459,15 @@ public static boolean shouldReportShuffleFetchFailure(long
taskId) {
int stageId = taskSetManager.stageId();
int stageAttemptId = taskSetManager.taskSet().stageAttemptId();
int maxTaskFails = taskSetManager.maxTaskFailures();
+ if (taskSetManager.isZombie()) {
+ LOG.warn(
+ "StageId={} stageAttemptId={} taskId={}: TaskSetManager is
zombie, skip reporting "
+ + "shuffle fetch failure to avoid invalidating active
shuffle data.",
+ stageId,
+ stageAttemptId,
+ taskId);
+ return false;
Review Comment:
`shouldReportShuffleFetchFailure` is invoked via a per-task precheck; if a
stage attempt becomes zombie while many tasks are still running, this `warn`
will likely be emitted once per task and can flood driver logs. Consider
downgrading to INFO/DEBUG and/or logging once per (stageId, stageAttemptId)
(e.g., track a per-stage flag) to keep log volume under control while still
preserving debuggability.
##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -323,6 +323,15 @@ public static boolean shouldReportShuffleFetchFailure(long
taskId) {
int stageId = taskSetManager.stageId();
int stageAttemptId = taskSetManager.taskSet().stageAttemptId();
int maxTaskFails = taskSetManager.maxTaskFailures();
+ if (taskSetManager.isZombie()) {
+ logger.warn(
+ "StageId={} stageAttemptId={} taskId={}: TaskSetManager is
zombie, skip reporting "
+ + "shuffle fetch failure to avoid invalidating active
shuffle data.",
+ stageId,
+ stageAttemptId,
+ taskId);
+ return false;
Review Comment:
`shouldReportShuffleFetchFailure` runs as a precheck for each fetch-failure
report; when a stage attempt is already zombie, many tasks can still report
failures and this `warn` can become very noisy. Recommend lowering to
INFO/DEBUG and/or emitting the message once per (stageId, stageAttemptId) to
avoid log flooding in large stages.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]