[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590748#comment-16590748 ] Apache Spark commented on SPARK-24415: -- User 'ankuriitg' has created a pull request for this issue: https://github.com/apache/spark/pull/22209 > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Critical > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511618#comment-16511618 ] Ankur Gupta commented on SPARK-24415: - I am planning to work on this JIRA > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Critical > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495224#comment-16495224 ] Thomas Graves commented on SPARK-24415: --- ok so the issue here is in the AppStatusListener where its only updating the task metrics for liveStages. It gets the second taskEnd event after it cancelled stage 2 so its no longer in the live stages array. > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495204#comment-16495204 ] Thomas Graves commented on SPARK-24415: --- It also looks like in the history server they show up properly in the aggregated metrics, although if you look at the Tasks (for all stages) column on the jobs page, it only lists a single task failure where it should list 2. > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495200#comment-16495200 ] Thomas Graves commented on SPARK-24415: --- this might actually be an order of events type thing. You will note that the config I have is stage.maxFailedTasksPerExecutor=1 so it should really only have 1 failed task, but looking at the log it seems it starts the second task before totally handling the blacklist from the first failure: 18/05/30 13:57:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 0, PROCESS_LOCAL, 7746 bytes) [Stage 2:> (0 + 1) / 10]18/05/30 13:57:20 INFO BlockManagerMasterEndpoint: Registering block manager gsrd259n13.red.ygrid.yahoo.com:43203 with 912.3 MB RAM, BlockManagerId(2, gsrd259n13.red.ygrid.yahoo.com, 43203, None) 18/05/30 13:57:21 INFO Client: Application report for application_1526529576371_25524 (state: RUNNING) 18/05/30 13:57:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on gsrd259n13.red.ygrid.yahoo.com:43203 (size: 1941.0 B, free: 912.3 MB) 18/05/30 13:57:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 1, PROCESS_LOCAL, 7747 bytes) 18/05/30 13:57:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, gsrd259n13.red.ygrid.yahoo.com, executor 2): java.lang.RuntimeException: Bad executor 18/05/30 13:57:21 INFO TaskSetBlacklist: Blacklisting executor 2 for stage 2 18/05/30 13:57:21 INFO YarnScheduler: Cancelling stage 2 18/05/30 13:57:21 INFO YarnScheduler: Stage 2 was cancelled 18/05/30 13:57:21 INFO DAGScheduler: ShuffleMapStage 2 (map at :26) failed in 12.063 s due to Job aborted due to stage failure: 18/05/30 13:57:21 INFO DAGScheduler: Job 1 failed: collect at :26, took 12.069052 s The thing is though that the executors page shows that it had 2 task failures on that node, its just in the aggregated metrics for that stage that doesn't have it. > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org