[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely

2019-05-28 Thread Atul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849742#comment-16849742
 ] 

Atul Anand commented on SPARK-13182:


# Yarn policy is to preempt a job in low priority queue for some job in higher 
priority queue. It is doing exactly that. So IMHO nothing wrong with YARN 
policy.
 # YARN users(like Spark, Map-Reduce) decide what to do after preemption due to 
any reason. If Spark keeps relaunching containers infinitely, preemption is not 
actually handled.
 # This behaviour makes YARN job queue passed by "spark.yarn.queue" irrelevant.

[~mccheah]'s 
[commit|[https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-bad3987c83bd22d46416d3dd9d208e76R730]]
 made the optimisation to ignore non application failures.

IMHO we should have additional counter to limit retries due to non application 
errors, something like externalFailuresRetries = Inf by default.

For other people, who expect external failures to be preemptions only can set 
it to 1 or 2.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely

2019-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848904#comment-16848904
 ] 

Sean Owen commented on SPARK-13182:
---

On its face that sounds like a YARN-related issue. Rescheduling a preempted 
task is correct, but not if there are not enough resources available to execute 
it. If resources became available but not for long enough to finish it, that's 
still an app-level and YARN policy issue.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely

2019-05-27 Thread Atul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848749#comment-16848749
 ] 

Atul Anand commented on SPARK-13182:


[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see. 
[here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]].

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org