[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848749#comment-16848749
 ] 

Atul Anand commented on SPARK-13182:
------------------------------------

[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see. 
[here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]].

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.

> Spark Executor retries infinitely
> ---------------------------------
>
>                 Key: SPARK-13182
>                 URL: https://issues.apache.org/jira/browse/SPARK-13182
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Prabhu Joseph
>            Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..........
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to