t oo created SPARK-33085:
----------------------------

             Summary: "Master removed our application" error leads to FAILED 
driver status instead of KILLED driver status
                 Key: SPARK-33085
                 URL: https://issues.apache.org/jira/browse/SPARK-33085
             Project: Spark
          Issue Type: Bug
          Components: Scheduler, Spark Core
    Affects Versions: 2.4.6
            Reporter: t oo


 

driver-20200930160855-0316 exited with status FAILED

 

I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that 
myip.87 EC2 instance was terminated at 2020-09-30 16:16

 

*I would expect the overall driver status to be KILLED but instead it was 
FAILED*, my goal is to interpret FAILED status as 'don't rerun as non-transient 
error faced' but KILLED/ERROR status as 'yes, rerun as transient error faced'. 
But it looks like FAILED status is being set in below case of transient error:

  

Below are driver logs
{code:java}
2020-09-30 16:12:41,183 [main] INFO  
com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
s3a://redacted2020-09-30 16:12:41,183 [main] INFO  
com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR 
org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,372 
[dispatcher-event-loop-15] WARN  org.apache.spark.scheduler.TaskSetManager - 
Lost task 0.0 in stage 6.0 (TID 6, myip.87, executor 0): ExecutorLostFailure 
(executor 0 exited caused by one of the running tasks) Reason: Remote RPC 
client disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.2020-09-30 16:16:40,376 
[dispatcher-event-loop-13] WARN  
org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas 
available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 16:16:40,399 
[dispatcher-event-loop-2] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 
[dispatcher-event-loop-5] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 
[dispatcher-event-loop-11] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 
[dispatcher-event-loop-1] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 
[dispatcher-event-loop-12] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/5 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,410 
[dispatcher-event-loop-5] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/6 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,420 [dispatcher-event-loop-9] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/6 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,421 
[dispatcher-event-loop-9] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/7 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,423 [dispatcher-event-loop-15] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/7 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,424 
[dispatcher-event-loop-15] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/8 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,425 [dispatcher-event-loop-2] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/8 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,425 
[dispatcher-event-loop-2] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
executor ID app-20200930160902-0895/9 on hostPort myip.87:11647 with 2 core(s), 
5.0 GB RAM2020-09-30 16:16:40,427 [dispatcher-event-loop-14] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
app-20200930160902-0895/9 removed: java.lang.IllegalStateException: Shutdown 
hooks cannot be modified during shutdown.2020-09-30 16:16:40,429 
[dispatcher-event-loop-5] ERROR 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Application has 
been killed. Reason: Master removed our application: FAILED2020-09-30 
16:16:40,438 [main] ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 
564822f2-f2fd-42cd-8d57-b6d5dff145f6.org.apache.spark.SparkException: Job 
aborted due to stage failure: Master removed our application: FAILED at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878) at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at scala.Option.foreach(Option.scala:257) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272) at 
com.yotpo.metorikku.output.writers.file.FileOutputWriter.save(FileOutputWriter.scala:134)
 at 
com.yotpo.metorikku.output.writers.file.FileOutputWriter.write(FileOutputWriter.scala:65)
 at 
com.yotpo.metorikku.metric.Metric.com$yotpo$metorikku$metric$Metric$$writeBatch(Metric.scala:97)
 at com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:136) 
at com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:125) 
at scala.collection.immutable.List.foreach(List.scala:392) at 
com.yotpo.metorikku.metric.Metric.write(Metric.scala:125) at 
com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:44) 
at 
com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:39) 
at scala.collection.immutable.List.foreach(List.scala:392) at 
com.yotpo.metorikku.metric.MetricSet.run(MetricSet.scala:39) at 
com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:17) 
at 
com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:15) 
at scala.collection.immutable.List.foreach(List.scala:392) at 
com.yotpo.metorikku.Metorikku$.runMetrics(Metorikku.scala:15) at 
com.yotpo.metorikku.Metorikku$.delayedEndpoint$com$yotpo$metorikku$Metorikku$1(Metorikku.scala:11)
 at com.yotpo.metorikku.Metorikku$delayedInit$body.apply(Metorikku.scala:7) at 
scala.Function0$class.apply$mcV$sp(Function0.scala:34) at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at 
scala.App$$anonfun$main$1.apply(App.scala:76) at 
scala.App$$anonfun$main$1.apply(App.scala:76) at 
scala.collection.immutable.List.foreach(List.scala:392) at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
 at scala.App$class.main(App.scala:76) at 
com.yotpo.metorikku.Metorikku$.main(Metorikku.scala:7) at 
com.yotpo.metorikku.Metorikku.main(Metorikku.scala) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)2020-09-30
 16:16:40,457 [stop-spark-context] INFO  
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down 
all executors2020-09-30 16:16:40,461 [stop-spark-context] ERROR 
org.apache.spark.util.Utils - Uncaught exception in thread 
stop-spark-contextorg.apache.spark.SparkException: Exception thrown in 
awaitResult:  at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:283)
 at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:227)
 at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:124)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:669) 
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2044) at 
org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at 
org.apache.spark.SparkContext.stop(SparkContext.scala:1948) at 
org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1903)Caused by: 
org.apache.spark.SparkException: Could not find AppClient. at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at 
org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at 
org.apache.spark.rpc.RpcEndpointRef.ask(RpcEndpointRef.scala:63) ... 9 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to