[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

Thomas Graves (JIRA) Wed, 30 May 2018 07:01:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495200#comment-16495200
 ]


Thomas Graves commented on SPARK-24415:
---------------------------------------

this might actually be an order of events type thing.  You will note that the 
config I have is stage.maxFailedTasksPerExecutor=1 so it should really only 
have 1 failed task, but looking at the log it seems it starts the second task 
before totally handling the blacklist from the first failure:

 

18/05/30 13:57:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 0, PROCESS_LOCAL, 7746 
bytes)
[Stage 2:> (0 + 1) / 10]18/05/30 13:57:20 INFO BlockManagerMasterEndpoint: 
Registering block manager gsrd259n13.red.ygrid.yahoo.com:43203 with 912.3 MB 
RAM, BlockManagerId(2, gsrd259n13.red.ygrid.yahoo.com, 43203, None)
18/05/30 13:57:21 INFO Client: Application report for 
application_1526529576371_25524 (state: RUNNING)
18/05/30 13:57:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
gsrd259n13.red.ygrid.yahoo.com:43203 (size: 1941.0 B, free: 912.3 MB)
18/05/30 13:57:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, 
gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 1, PROCESS_LOCAL, 7747 
bytes)
18/05/30 13:57:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
gsrd259n13.red.ygrid.yahoo.com, executor 2): java.lang.RuntimeException: Bad 
executor

....

18/05/30 13:57:21 INFO TaskSetBlacklist: Blacklisting executor 2 for stage 2

18/05/30 13:57:21 INFO YarnScheduler: Cancelling stage 2
18/05/30 13:57:21 INFO YarnScheduler: Stage 2 was cancelled
18/05/30 13:57:21 INFO DAGScheduler: ShuffleMapStage 2 (map at <console>:26) 
failed in 12.063 s due to Job aborted due to stage failure:

18/05/30 13:57:21 INFO DAGScheduler: Job 1 failed: collect at <console>:26, 
took 12.069052 s

 

The thing is though that the executors page shows that it had 2 task failures 
on that node, its just in the aggregated metrics for that stage that doesn't 
have it.

> Stage page aggregated executor metrics wrong when failures 
> -----------------------------------------------------------
>
>                 Key: SPARK-24415
>                 URL: https://issues.apache.org/jira/browse/SPARK-24415
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 2.3.0
>            Reporter: Thomas Graves
>            Priority: Major
>         Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 10000, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

Reply via email to