[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

Josh Rosen (JIRA) Sat, 29 Nov 2014 12:55:43 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228937#comment-14228937
 ]


Josh Rosen commented on SPARK-4498:
-----------------------------------

Hi [~airhorns],

I finally got a chance to look into this and, based on reading the code, I have 
a theory about what might be happening.  If you look at the [current 
Master.scala 
file|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L668],
 you'll notice that there are only two situations where the standalone Master 
removes applications:

- The master receives a DisassociatedEvent due to the application actor 
shutting down and calls {{finishApplication}}.
- An executor exited with a non-zero exit status and the maximum number of 
executor failures has been succeeded.

Now, imagine that for some reason the standalone Master does not receive a 
DisassociatedEvent.  When executors eventually start to die, the standalone 
master will discover this via ExecutorStateChanged.  If it hasn't hit the 
maximum number of executor failures, [it will attempt to re-schedule the 
application|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L325]
 and obtain new resources.  If a new executor is granted, this will [cause the 
"maximum failed executors count" to reset to 
zero|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L313],
 leading to a sort of livelock behavior where executors die because they can't 
contact the application but keep being launched because executors keep entering 
the ExecutorState.RUNNING state ([it looks 
like|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L148]
 executors transition to this state when they launch, not once they've 
registered with the driver).

It looks like the line

{code}
          if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
{code}

was introduced in SPARK-2425.  It looks like this was introduced after the 
earliest commit that you mentioned, so it seems like this is be a regression in 
1.2.0.  I don't think that we should revert SPARK-2425, since that fixes 
another fairly important bug.  Instead, I'd like to try to figure out how an 
application could fail without a DisassociatedEvent causing it to be removed.

Could this be due to our use of non-standard Akka timeout / failure detector 
settings?  I would think that we'd still get a DisassociatedEvent when a 
network connection was closed or something.  Maybe we could switch to relying 
on our own explicit heartbeats for failure detection, like we do elsewhere in 
Spark.

[~markhamstra], do you have any ideas here?

> Standalone Master can fail to recognize completed/failed applications
> ---------------------------------------------------------------------
>
>                 Key: SPARK-4498
>                 URL: https://issues.apache.org/jira/browse/SPARK-4498
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Spark Core
>    Affects Versions: 1.2.0
>         Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
> #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
>  - Standalone Spark built from 
> apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
>  - Python 2.7.3
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
> memory a piece
>  - All client code is PySpark
>            Reporter: Harry Brundage
>            Priority: Critical
>         Attachments: all-master-logs-around-blip.txt, 
> one-applications-master-logs.txt
>
>
> We observe the spark standalone master not detecting that a driver 
> application has completed after the driver process has shut down 
> indefinitely, leaving that driver's resources consumed indefinitely. The 
> master reports applications as Running, but the driver process has long since 
> terminated. The master continually spawns one executor for the application. 
> It boots, times out trying to connect to the driver application, and then 
> dies with the exception below. The master then spawns another executor on a 
> different worker, which does the same thing. The application lives until the 
> master (and workers) are restarted. 
> This happens to many jobs at once, all right around the same time, two or 
> three times a day, where they all get suck. Before and after this "blip" 
> applications start, get resources, finish, and are marked as finished 
> properly. The "blip" is mostly conjecture on my part, I have no hard evidence 
> that it exists other than my identification of the pattern in the Running 
> Applications table. See 
> http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
>  : the applications started before the blip at 1.9 hours ago still have 
> active drivers. All the applications started 1.9 hours ago do not, and the 
> applications started less than 1.9 hours ago (at the top of the table) do in 
> fact have active drivers.
> Deploy mode:
>  - PySpark drivers running on one node outside the cluster, scheduled by a 
> cron-like application, not master supervised
>  
> Other factoids:
>  - In most places, we call sc.stop() explicitly before shutting down our 
> driver process
>  - Here's the sum total of spark configuration options we don't set to the 
> default:
> {code}
>     "spark.cores.max": 30
>     "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
>     "spark.eventLog.enabled": true
>     "spark.executor.memory": "7g"
>     "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
>     "spark.io.compression.codec": "lzf"
>     "spark.ui.killEnabled": true
> {code}
>  - The exception the executors die with is this:
> {code}
> 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
> handlers for [TERM, HUP, INT]
> 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
> users with modify permissions: Set(spark, azkaban)
> 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
> 14/11/19 19:42:37 INFO Remoting: Starting remoting
> 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
> 14/11/19 19:42:38 INFO Utils: Successfully started service 
> 'driverPropsFetcher' on port 37682.
> 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
> address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
> now gated for 5000 ms, all messages to this address will be delivered to dead 
> letters. Reason: Connection refused: 
> spark-etl1.chi.shopify.com/172.16.126.88:58849
> 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
> as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
> timed out after [30 seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
> Unknown exception in doAs
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException: 
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>       ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]
>       at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>       at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>       at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>       at scala.concurrent.Await$.result(package.scala:107)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>       ... 7 more
> {code}
> Cluster history:
>  - We run spark versions built from apache/spark#master snapshots. We did not 
> observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry 
> its so old), but now observe it on 
> {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to 
> assist debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

Reply via email to