[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849742#comment-16849742
 ] 

Atul Anand commented on SPARK-13182:
------------------------------------

# Yarn policy is to preempt a job in low priority queue for some job in higher 
priority queue. It is doing exactly that. So IMHO nothing wrong with YARN 
policy.
 # YARN users(like Spark, Map-Reduce) decide what to do after preemption due to 
any reason. If Spark keeps relaunching containers infinitely, preemption is not 
actually handled.
 # This behaviour makes YARN job queue passed by "spark.yarn.queue" irrelevant.

[~mccheah]'s 
[commit|[https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-bad3987c83bd22d46416d3dd9d208e76R730]]
 made the optimisation to ignore non application failures.

IMHO we should have additional counter to limit retries due to non application 
errors, something like externalFailuresRetries = Inf by default.

For other people, who expect external failures to be preemptions only can set 
it to 1 or 2.

> Spark Executor retries infinitely
> ---------------------------------
>
>                 Key: SPARK-13182
>                 URL: https://issues.apache.org/jira/browse/SPARK-13182
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Prabhu Joseph
>            Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..........
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to