[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495200#comment-16495200 ]
Thomas Graves commented on SPARK-24415: --------------------------------------- this might actually be an order of events type thing. You will note that the config I have is stage.maxFailedTasksPerExecutor=1 so it should really only have 1 failed task, but looking at the log it seems it starts the second task before totally handling the blacklist from the first failure: 18/05/30 13:57:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 0, PROCESS_LOCAL, 7746 bytes) [Stage 2:> (0 + 1) / 10]18/05/30 13:57:20 INFO BlockManagerMasterEndpoint: Registering block manager gsrd259n13.red.ygrid.yahoo.com:43203 with 912.3 MB RAM, BlockManagerId(2, gsrd259n13.red.ygrid.yahoo.com, 43203, None) 18/05/30 13:57:21 INFO Client: Application report for application_1526529576371_25524 (state: RUNNING) 18/05/30 13:57:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on gsrd259n13.red.ygrid.yahoo.com:43203 (size: 1941.0 B, free: 912.3 MB) 18/05/30 13:57:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 1, PROCESS_LOCAL, 7747 bytes) 18/05/30 13:57:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, gsrd259n13.red.ygrid.yahoo.com, executor 2): java.lang.RuntimeException: Bad executor .... 18/05/30 13:57:21 INFO TaskSetBlacklist: Blacklisting executor 2 for stage 2 18/05/30 13:57:21 INFO YarnScheduler: Cancelling stage 2 18/05/30 13:57:21 INFO YarnScheduler: Stage 2 was cancelled 18/05/30 13:57:21 INFO DAGScheduler: ShuffleMapStage 2 (map at <console>:26) failed in 12.063 s due to Job aborted due to stage failure: 18/05/30 13:57:21 INFO DAGScheduler: Job 1 failed: collect at <console>:26, took 12.069052 s The thing is though that the executors page shows that it had 2 task failures on that node, its just in the aggregated metrics for that stage that doesn't have it. > Stage page aggregated executor metrics wrong when failures > ----------------------------------------------------------- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI > Affects Versions: 2.3.0 > Reporter: Thomas Graves > Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 10000, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org