[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228937#comment-14228937 ]
Josh Rosen commented on SPARK-4498: ----------------------------------- Hi [~airhorns], I finally got a chance to look into this and, based on reading the code, I have a theory about what might be happening. If you look at the [current Master.scala file|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L668], you'll notice that there are only two situations where the standalone Master removes applications: - The master receives a DisassociatedEvent due to the application actor shutting down and calls {{finishApplication}}. - An executor exited with a non-zero exit status and the maximum number of executor failures has been succeeded. Now, imagine that for some reason the standalone Master does not receive a DisassociatedEvent. When executors eventually start to die, the standalone master will discover this via ExecutorStateChanged. If it hasn't hit the maximum number of executor failures, [it will attempt to re-schedule the application|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L325] and obtain new resources. If a new executor is granted, this will [cause the "maximum failed executors count" to reset to zero|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L313], leading to a sort of livelock behavior where executors die because they can't contact the application but keep being launched because executors keep entering the ExecutorState.RUNNING state ([it looks like|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L148] executors transition to this state when they launch, not once they've registered with the driver). It looks like the line {code} if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() } {code} was introduced in SPARK-2425. It looks like this was introduced after the earliest commit that you mentioned, so it seems like this is be a regression in 1.2.0. I don't think that we should revert SPARK-2425, since that fixes another fairly important bug. Instead, I'd like to try to figure out how an application could fail without a DisassociatedEvent causing it to be removed. Could this be due to our use of non-standard Akka timeout / failure detector settings? I would think that we'd still get a DisassociatedEvent when a network connection was closed or something. Maybe we could switch to relying on our own explicit heartbeats for failure detection, like we do elsewhere in Spark. [~markhamstra], do you have any ideas here? > Standalone Master can fail to recognize completed/failed applications > --------------------------------------------------------------------- > > Key: SPARK-4498 > URL: https://issues.apache.org/jira/browse/SPARK-4498 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core > Affects Versions: 1.2.0 > Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic > #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux > - Standalone Spark built from > apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 > - Python 2.7.3 > java version "1.7.0_71" > Java(TM) SE Runtime Environment (build 1.7.0_71-b14) > Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) > - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of > memory a piece > - All client code is PySpark > Reporter: Harry Brundage > Priority: Critical > Attachments: all-master-logs-around-blip.txt, > one-applications-master-logs.txt > > > We observe the spark standalone master not detecting that a driver > application has completed after the driver process has shut down > indefinitely, leaving that driver's resources consumed indefinitely. The > master reports applications as Running, but the driver process has long since > terminated. The master continually spawns one executor for the application. > It boots, times out trying to connect to the driver application, and then > dies with the exception below. The master then spawns another executor on a > different worker, which does the same thing. The application lives until the > master (and workers) are restarted. > This happens to many jobs at once, all right around the same time, two or > three times a day, where they all get suck. Before and after this "blip" > applications start, get resources, finish, and are marked as finished > properly. The "blip" is mostly conjecture on my part, I have no hard evidence > that it exists other than my identification of the pattern in the Running > Applications table. See > http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png > : the applications started before the blip at 1.9 hours ago still have > active drivers. All the applications started 1.9 hours ago do not, and the > applications started less than 1.9 hours ago (at the top of the table) do in > fact have active drivers. > Deploy mode: > - PySpark drivers running on one node outside the cluster, scheduled by a > cron-like application, not master supervised > > Other factoids: > - In most places, we call sc.stop() explicitly before shutting down our > driver process > - Here's the sum total of spark configuration options we don't set to the > default: > {code} > "spark.cores.max": 30 > "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs" > "spark.eventLog.enabled": true > "spark.executor.memory": "7g" > "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/" > "spark.io.compression.codec": "lzf" > "spark.ui.killEnabled": true > {code} > - The exception the executors die with is this: > {code} > 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal > handlers for [TERM, HUP, INT] > 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban > 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban > 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); > users with modify permissions: Set(spark, azkaban) > 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started > 14/11/19 19:42:37 INFO Remoting: Starting remoting > 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] > 14/11/19 19:42:38 INFO Utils: Successfully started service > 'driverPropsFetcher' on port 37682. > 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote > address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is > now gated for 5000 ms, all messages to this address will be delivered to dead > letters. Reason: Connection refused: > spark-etl1.chi.shopify.com/172.16.126.88:58849 > 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException > as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures > timed out after [30 seconds] > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: > Unknown exception in doAs > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: java.security.PrivilegedActionException: > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > ... 4 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) > ... 7 more > {code} > Cluster history: > - We run spark versions built from apache/spark#master snapshots. We did not > observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry > its so old), but now observe it on > {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to > assist debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org