[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590748#comment-16590748
 ] 

Apache Spark commented on SPARK-24415:
--

User 'ankuriitg' has created a pull request for this issue:
https://github.com/apache/spark/pull/22209

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-06-13 Thread Ankur Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511618#comment-16511618
 ] 

Ankur Gupta commented on SPARK-24415:
-

I am planning to work on this JIRA

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495224#comment-16495224
 ] 

Thomas Graves commented on SPARK-24415:
---

ok so the issue here is in the AppStatusListener where its only updating the 
task metrics for liveStages.  It gets the second taskEnd event after it 
cancelled stage 2 so its no longer in the live stages array.  

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495204#comment-16495204
 ] 

Thomas Graves commented on SPARK-24415:
---

It also looks like in the history server they show up properly in the 
aggregated metrics, although if you look at the Tasks (for all stages) column 
on the jobs page, it only lists a single task failure where it should list 2.

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495200#comment-16495200
 ] 

Thomas Graves commented on SPARK-24415:
---

this might actually be an order of events type thing.  You will note that the 
config I have is stage.maxFailedTasksPerExecutor=1 so it should really only 
have 1 failed task, but looking at the log it seems it starts the second task 
before totally handling the blacklist from the first failure:

 

18/05/30 13:57:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 0, PROCESS_LOCAL, 7746 
bytes)
[Stage 2:> (0 + 1) / 10]18/05/30 13:57:20 INFO BlockManagerMasterEndpoint: 
Registering block manager gsrd259n13.red.ygrid.yahoo.com:43203 with 912.3 MB 
RAM, BlockManagerId(2, gsrd259n13.red.ygrid.yahoo.com, 43203, None)
18/05/30 13:57:21 INFO Client: Application report for 
application_1526529576371_25524 (state: RUNNING)
18/05/30 13:57:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
gsrd259n13.red.ygrid.yahoo.com:43203 (size: 1941.0 B, free: 912.3 MB)
18/05/30 13:57:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, 
gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 1, PROCESS_LOCAL, 7747 
bytes)
18/05/30 13:57:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
gsrd259n13.red.ygrid.yahoo.com, executor 2): java.lang.RuntimeException: Bad 
executor



18/05/30 13:57:21 INFO TaskSetBlacklist: Blacklisting executor 2 for stage 2

18/05/30 13:57:21 INFO YarnScheduler: Cancelling stage 2
18/05/30 13:57:21 INFO YarnScheduler: Stage 2 was cancelled
18/05/30 13:57:21 INFO DAGScheduler: ShuffleMapStage 2 (map at :26) 
failed in 12.063 s due to Job aborted due to stage failure:

18/05/30 13:57:21 INFO DAGScheduler: Job 1 failed: collect at :26, 
took 12.069052 s

 

The thing is though that the executors page shows that it had 2 task failures 
on that node, its just in the aggregated metrics for that stage that doesn't 
have it.

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org