[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218491#comment-14218491 ]
Harry Brundage commented on SPARK-4498: --------------------------------------- For the simple canary spark application (who's master logs are attached), the first executor does exactly what it is supposed to, and then the driver shuts down and it disassociates. Then, a second executor is started, when it never should have been. The first executors stderr: {code} 14/11/19 18:48:16 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 18:48:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 18:48:16 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:16 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:17 INFO Slf4jLogger: Slf4jLogger started 14/11/19 18:48:17 INFO Remoting: Starting remoting 14/11/19 18:48:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn42.chi.shopify.com:36365] 14/11/19 18:48:17 INFO Utils: Successfully started service 'driverPropsFetcher' on port 36365. 14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/11/19 18:48:18 INFO Slf4jLogger: Slf4jLogger started 14/11/19 18:48:18 INFO Remoting: Starting remoting 14/11/19 18:48:18 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkexecu...@dn42.chi.shopify.com:39974] 14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 14/11/19 18:48:18 INFO Utils: Successfully started service 'sparkExecutor' on port 39974. 14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/CoarseGrainedScheduler 14/11/19 18:48:18 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker 14/11/19 18:48:18 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker 14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:18 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/MapOutputTracker 14/11/19 18:48:18 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/BlockManagerMaster 14/11/19 18:48:18 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141119184818-e5ae 14/11/19 18:48:18 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/11/19 18:48:18 INFO LogReporter: Creating metrics output file: /tmp/spark-metrics 14/11/19 18:48:18 INFO NettyBlockTransferService: Server created on 44406 14/11/19 18:48:18 INFO BlockManagerMaster: Trying to register BlockManager 14/11/19 18:48:18 INFO BlockManagerMaster: Registered BlockManager 14/11/19 18:48:18 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/HeartbeatReceiver 14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Got assigned task 0 14/11/19 18:48:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/11/19 18:48:18 INFO Executor: Fetching hdfs://nn01.chi.shopify.com:8020/tmp/starscream_pyfiles_cache/packages-68badf293ff9e4d13929073572892d7f3d0a3546.egg with timestamp 1416422896241 14/11/19 18:48:19 INFO TorrentBroadcast: Started reading broadcast variable 1 14/11/19 18:48:19 INFO MemoryStore: ensureFreeSpace(3391) called with curMem=0, maxMem=278302556 14/11/19 18:48:19 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.3 KB, free 265.4 MB) 14/11/19 18:48:19 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 14/11/19 18:48:19 INFO BlockManager: Got told to re-register updating block broadcast_1_piece0 14/11/19 18:48:19 INFO BlockManager: BlockManager re-registering with master 14/11/19 18:48:19 INFO BlockManagerMaster: Trying to register BlockManager 14/11/19 18:48:19 INFO TorrentBroadcast: Reading broadcast variable 1 took 218 ms 14/11/19 18:48:19 INFO BlockManagerMaster: Registered BlockManager 14/11/19 18:48:19 INFO BlockManager: Reporting 1 blocks to the master. 14/11/19 18:48:19 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 14/11/19 18:48:20 INFO MemoryStore: ensureFreeSpace(5288) called with curMem=3391, maxMem=278302556 14/11/19 18:48:20 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.2 KB, free 265.4 MB) 14/11/19 18:48:20 INFO HadoopRDD: Input split: hdfs://nn01.chi.shopify.com:8020/data/static/canary/part-00000:0+6 14/11/19 18:48:20 INFO TorrentBroadcast: Started reading broadcast variable 0 14/11/19 18:48:20 INFO MemoryStore: ensureFreeSpace(10705) called with curMem=8679, maxMem=278302556 14/11/19 18:48:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 10.5 KB, free 265.4 MB) 14/11/19 18:48:20 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 14/11/19 18:48:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 17 ms 14/11/19 18:48:20 WARN Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 14/11/19 18:48:20 INFO MemoryStore: ensureFreeSpace(186596) called with curMem=19384, maxMem=278302556 14/11/19 18:48:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 182.2 KB, free 265.2 MB) 14/11/19 18:48:20 INFO PythonRDD: Times: total = 531, boot = 229, init = 302, finish = 0 14/11/19 18:48:20 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1847 bytes result sent to driver 14/11/19 18:48:22 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkexecu...@dn42.chi.shopify.com:39974] -> [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849] disassociated! Shutting down. {code} and the second executor's stderr: {code} 14/11/19 18:48:24 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 18:48:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 18:48:24 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:24 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:25 INFO Slf4jLogger: Slf4jLogger started 14/11/19 18:48:25 INFO Remoting: Starting remoting 14/11/19 18:48:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn42.chi.shopify.com:56495] 14/11/19 18:48:25 INFO Utils: Successfully started service 'driverPropsFetcher' on port 56495. 14/11/19 18:48:25 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 18:48:55 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) ... 7 more {code} > Standalone Master can fail to recognize completed/failed applications > --------------------------------------------------------------------- > > Key: SPARK-4498 > URL: https://issues.apache.org/jira/browse/SPARK-4498 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.2.0 > Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic > #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux > - Standalone Spark built from > apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 > - Python 2.7.3 > java version "1.7.0_71" > Java(TM) SE Runtime Environment (build 1.7.0_71-b14) > Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) > - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of > memory a piece > - All client code is PySpark > Reporter: Harry Brundage > Attachments: all-master-logs-around-blip.txt, > one-applications-master-logs.txt > > > We observe the spark standalone master not detecting that a driver > application has completed after the driver process has shut down > indefinitely, leaving that driver's resources consumed indefinitely. The > master reports applications as Running, but the driver process has long since > terminated. The master continually spawns one executor for the application. > It boots, times out trying to connect to the driver application, and then > dies with the exception below. The master then spawns another executor on a > different worker, which does the same thing. The application lives until the > master (and workers) are restarted. > This happens to many jobs at once, all right around the same time, two or > three times a day, where they all get suck. Before and after this "blip" > applications start, get resources, finish, and are marked as finished > properly. The "blip" is mostly conjecture on my part, I have no hard evidence > that it exists other than my identification of the pattern in the Running > Applications table. See > http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png > : the applications started before the blip at 1.9 hours ago still have > active drivers. All the applications started 1.9 hours ago do not, and the > applications started less than 1.9 hours ago (at the top of the table) do in > fact have active drivers. > Deploy mode: > - PySpark drivers running on one node outside the cluster, scheduled by a > cron-like application, not master supervised > > Other factoids: > - In most places, we call sc.stop() explicitly before shutting down our > driver process > - Here's the sum total of spark configuration options we don't set to the > default: > {code} > "spark.cores.max": 30 > "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs" > "spark.eventLog.enabled": true > "spark.executor.memory": "7g" > "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/" > "spark.io.compression.codec": "lzf" > "spark.ui.killEnabled": true > {code} > - The exception the executors die with is this: > {code} > 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal > handlers for [TERM, HUP, INT] > 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban > 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban > 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); > users with modify permissions: Set(spark, azkaban) > 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started > 14/11/19 19:42:37 INFO Remoting: Starting remoting > 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] > 14/11/19 19:42:38 INFO Utils: Successfully started service > 'driverPropsFetcher' on port 37682. > 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote > address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is > now gated for 5000 ms, all messages to this address will be delivered to dead > letters. Reason: Connection refused: > spark-etl1.chi.shopify.com/172.16.126.88:58849 > 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException > as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures > timed out after [30 seconds] > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: > Unknown exception in doAs > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: java.security.PrivilegedActionException: > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > ... 4 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) > ... 7 more > {code} > Cluster history: > - We run spark versions built from apache/spark#master snapshots. We did not > observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry > its so old), but now observe it on > {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to > assist debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org