[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282201#comment-17282201 ] Attila Zsolt Piros commented on SPARK-33763: I am not positive about the "stage re-submitted because of fetch failure" solution too as "stages.failedStages.count" is already available and most failed stages are retried. When the tests on my PR (which contains the counter metrics for the different loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP label).{{}} > Add metrics for better tracking of dynamic allocation > - > > Key: SPARK-33763 > URL: https://issues.apache.org/jira/browse/SPARK-33763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > We should add metrics to track the following: > 1- Graceful decommissions & DA scheduled deletes > 2- Jobs resubmitted > 3- Fetch failures > 4- Unexpected (e.g. non-Spark triggered) executor removals. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277694#comment-17277694 ] Apache Spark commented on SPARK-33763: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/31450 > Add metrics for better tracking of dynamic allocation > - > > Key: SPARK-33763 > URL: https://issues.apache.org/jira/browse/SPARK-33763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > We should add metrics to track the following: > 1- Graceful decommissions & DA scheduled deletes > 2- Jobs resubmitted > 3- Fetch failures > 4- Unexpected (e.g. non-Spark triggered) executor removals. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277675#comment-17277675 ] Attila Zsolt Piros commented on SPARK-33763: I am ready with the executor removals (1 and 4 from the above list) but I would like to discuss stage resubmitted (I think you probably meant stage and not jobs) and fetch failures. I thought about these two missing metrics (and checked the code too) and I have a suggestion. Let's combine those two to one single metric: stage resubmitted because of fetch failure. Justification: the number of fetch failures will be very much dependant on the cluster size (and even worse on the data). When one executor is down all the others fetching from that failed one will report a fetch failure. So this information is not that helpful as it depends on how many reducers are referring to that single mapper. [~holden] what do you think? > Add metrics for better tracking of dynamic allocation > - > > Key: SPARK-33763 > URL: https://issues.apache.org/jira/browse/SPARK-33763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > We should add metrics to track the following: > 1- Graceful decommissions & DA scheduled deletes > 2- Jobs resubmitted > 3- Fetch failures > 4- Unexpected (e.g. non-Spark triggered) executor removals. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273923#comment-17273923 ] Attila Zsolt Piros commented on SPARK-33763: I started working on this. > Add metrics for better tracking of dynamic allocation > - > > Key: SPARK-33763 > URL: https://issues.apache.org/jira/browse/SPARK-33763 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > We should add metrics to track the following: > 1- Graceful decommissions & DA scheduled deletes > 2- Jobs resubmitted > 3- Fetch failures > 4- Unexpected (e.g. non-Spark triggered) executor removals. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org