Ngone51 commented on code in PR #36162: URL: https://github.com/apache/spark/pull/36162#discussion_r885645567
########## core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala: ########## @@ -853,8 +857,11 @@ private[spark] class TaskSchedulerImpl( // (taskId, stageId, stageAttemptId, accumUpdates) val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] = { accumUpdates.flatMap { case (id, updates) => - val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None)) Option(taskIdToTaskSetManager.get(id)).map { taskSetMgr => + val (accInfos, taskProgressRate) = getTaskAccumulableInfosAndProgressRate(updates) Review Comment: I'm a bit worried about the scheduler's throughput if our concerns on the accumulators' traverse efficiency matter. I still think we could only traverse inside the speculation thread to decouple with the scheduling thread. If we move this stuff to the speculation thread, we can also avoid unnecessary traverses since it's only necessary when `checkSpeculatableTasks` requires, while with the current implementation it traverses for each heartbeat update and successful task completion. If we want to move it to the speculation thread, the implementation could be also a bit simpler. At `TaskSchedulerImpl.executorHeartbeatReceived()`, we should only set `_accumulables`. And we don't need to set `_accumulables` by us, which is already covered by `DAGScheudler.updateAccumulators()`. Then, we'd only need to focus on the calculation/traverses at `InefficientTaskCalculator`. It might be a bit slow for the first-time traverses but we can cache the records/runtime for the finished tasks or progress rate for the running tasks. And even if it's slow, I think it's still better compared to slow the scheduling threads. @weixiuli @mridulm WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org