[ https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282201#comment-17282201 ]
Attila Zsolt Piros commented on SPARK-33763: -------------------------------------------- I am not positive about the "stage re-submitted because of fetch failure" solution too as "stages.failedStages.count" is already available and most failed stages are retried. When the tests on my PR (which contains the counter metrics for the different loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP label).{{}} > Add metrics for better tracking of dynamic allocation > ----------------------------------------------------- > > Key: SPARK-33763 > URL: https://issues.apache.org/jira/browse/SPARK-33763 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.2.0 > Reporter: Holden Karau > Priority: Major > > We should add metrics to track the following: > 1- Graceful decommissions & DA scheduled deletes > 2- Jobs resubmitted > 3- Fetch failures > 4- Unexpected (e.g. non-Spark triggered) executor removals. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org