[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

Harry Brundage (JIRA) Wed, 19 Nov 2014 12:58:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218491#comment-14218491
 ]


Harry Brundage commented on SPARK-4498:
---------------------------------------

For the simple canary spark application (who's master logs are attached), the 
first executor does exactly what it is supposed to, and then the driver shuts 
down and it disassociates. Then, a second executor is started, when it never 
should have been. The first executors stderr:

{code}
14/11/19 18:48:16 INFO CoarseGrainedExecutorBackend: Registered signal handlers 
for [TERM, HUP, INT]
14/11/19 18:48:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/19 18:48:16 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:16 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:16 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:17 INFO Slf4jLogger: Slf4jLogger started
14/11/19 18:48:17 INFO Remoting: Starting remoting
14/11/19 18:48:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@dn42.chi.shopify.com:36365]
14/11/19 18:48:17 INFO Utils: Successfully started service 'driverPropsFetcher' 
on port 36365.
14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
14/11/19 18:48:18 INFO Slf4jLogger: Slf4jLogger started
14/11/19 18:48:18 INFO Remoting: Starting remoting
14/11/19 18:48:18 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkexecu...@dn42.chi.shopify.com:39974]
14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
14/11/19 18:48:18 INFO Utils: Successfully started service 'sparkExecutor' on 
port 39974.
14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/CoarseGrainedScheduler
14/11/19 18:48:18 INFO WorkerWatcher: Connecting to worker 
akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker
14/11/19 18:48:18 INFO WorkerWatcher: Successfully connected to 
akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker
14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Successfully registered 
with driver
14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:18 INFO AkkaUtils: Connecting to MapOutputTracker: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/MapOutputTracker
14/11/19 18:48:18 INFO AkkaUtils: Connecting to BlockManagerMaster: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/BlockManagerMaster
14/11/19 18:48:18 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20141119184818-e5ae
14/11/19 18:48:18 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/11/19 18:48:18 INFO LogReporter: Creating metrics output file: 
/tmp/spark-metrics
14/11/19 18:48:18 INFO NettyBlockTransferService: Server created on 44406
14/11/19 18:48:18 INFO BlockManagerMaster: Trying to register BlockManager
14/11/19 18:48:18 INFO BlockManagerMaster: Registered BlockManager
14/11/19 18:48:18 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/HeartbeatReceiver
14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Got assigned task 0
14/11/19 18:48:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/19 18:48:18 INFO Executor: Fetching 
hdfs://nn01.chi.shopify.com:8020/tmp/starscream_pyfiles_cache/packages-68badf293ff9e4d13929073572892d7f3d0a3546.egg
 with timestamp 1416422896241
14/11/19 18:48:19 INFO TorrentBroadcast: Started reading broadcast variable 1
14/11/19 18:48:19 INFO MemoryStore: ensureFreeSpace(3391) called with curMem=0, 
maxMem=278302556
14/11/19 18:48:19 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 3.3 KB, free 265.4 MB)
14/11/19 18:48:19 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
14/11/19 18:48:19 INFO BlockManager: Got told to re-register updating block 
broadcast_1_piece0
14/11/19 18:48:19 INFO BlockManager: BlockManager re-registering with master
14/11/19 18:48:19 INFO BlockManagerMaster: Trying to register BlockManager
14/11/19 18:48:19 INFO TorrentBroadcast: Reading broadcast variable 1 took 218 
ms
14/11/19 18:48:19 INFO BlockManagerMaster: Registered BlockManager
14/11/19 18:48:19 INFO BlockManager: Reporting 1 blocks to the master.
14/11/19 18:48:19 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
14/11/19 18:48:20 INFO MemoryStore: ensureFreeSpace(5288) called with 
curMem=3391, maxMem=278302556
14/11/19 18:48:20 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 5.2 KB, free 265.4 MB)
14/11/19 18:48:20 INFO HadoopRDD: Input split: 
hdfs://nn01.chi.shopify.com:8020/data/static/canary/part-00000:0+6
14/11/19 18:48:20 INFO TorrentBroadcast: Started reading broadcast variable 0
14/11/19 18:48:20 INFO MemoryStore: ensureFreeSpace(10705) called with 
curMem=8679, maxMem=278302556
14/11/19 18:48:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 10.5 KB, free 265.4 MB)
14/11/19 18:48:20 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0
14/11/19 18:48:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 17 ms
14/11/19 18:48:20 WARN Configuration: fs.default.name is deprecated. Instead, 
use fs.defaultFS
14/11/19 18:48:20 INFO MemoryStore: ensureFreeSpace(186596) called with 
curMem=19384, maxMem=278302556
14/11/19 18:48:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 182.2 KB, free 265.2 MB)
14/11/19 18:48:20 INFO PythonRDD: Times: total = 531, boot = 229, init = 302, 
finish = 0
14/11/19 18:48:20 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1847 
bytes result sent to driver
14/11/19 18:48:22 ERROR CoarseGrainedExecutorBackend: Driver Disassociated 
[akka.tcp://sparkexecu...@dn42.chi.shopify.com:39974] -> 
[akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849] disassociated! 
Shutting down.
{code}

and the second executor's stderr:

{code}
14/11/19 18:48:24 INFO CoarseGrainedExecutorBackend: Registered signal handlers 
for [TERM, HUP, INT]
14/11/19 18:48:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/19 18:48:24 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:24 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:24 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:25 INFO Slf4jLogger: Slf4jLogger started
14/11/19 18:48:25 INFO Remoting: Starting remoting
14/11/19 18:48:25 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@dn42.chi.shopify.com:56495]
14/11/19 18:48:25 INFO Utils: Successfully started service 'driverPropsFetcher' 
on port 56495.
14/11/19 18:48:25 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
now gated for 5000 ms, all messages to this address will be delivered to dead 
letters. Reason: Connection refused: 
spark-etl1.chi.shopify.com/172.16.126.88:58849
14/11/19 18:48:55 ERROR UserGroupInformation: PriviledgedActionException 
as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
Unknown exception in doAs
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
        at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: 
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        ... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
        ... 7 more
{code}

> Standalone Master can fail to recognize completed/failed applications
> ---------------------------------------------------------------------
>
>                 Key: SPARK-4498
>                 URL: https://issues.apache.org/jira/browse/SPARK-4498
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>         Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
> #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
>  - Standalone Spark built from 
> apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
>  - Python 2.7.3
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
> memory a piece
>  - All client code is PySpark
>            Reporter: Harry Brundage
>         Attachments: all-master-logs-around-blip.txt, 
> one-applications-master-logs.txt
>
>
> We observe the spark standalone master not detecting that a driver 
> application has completed after the driver process has shut down 
> indefinitely, leaving that driver's resources consumed indefinitely. The 
> master reports applications as Running, but the driver process has long since 
> terminated. The master continually spawns one executor for the application. 
> It boots, times out trying to connect to the driver application, and then 
> dies with the exception below. The master then spawns another executor on a 
> different worker, which does the same thing. The application lives until the 
> master (and workers) are restarted. 
> This happens to many jobs at once, all right around the same time, two or 
> three times a day, where they all get suck. Before and after this "blip" 
> applications start, get resources, finish, and are marked as finished 
> properly. The "blip" is mostly conjecture on my part, I have no hard evidence 
> that it exists other than my identification of the pattern in the Running 
> Applications table. See 
> http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
>  : the applications started before the blip at 1.9 hours ago still have 
> active drivers. All the applications started 1.9 hours ago do not, and the 
> applications started less than 1.9 hours ago (at the top of the table) do in 
> fact have active drivers.
> Deploy mode:
>  - PySpark drivers running on one node outside the cluster, scheduled by a 
> cron-like application, not master supervised
>  
> Other factoids:
>  - In most places, we call sc.stop() explicitly before shutting down our 
> driver process
>  - Here's the sum total of spark configuration options we don't set to the 
> default:
> {code}
>     "spark.cores.max": 30
>     "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
>     "spark.eventLog.enabled": true
>     "spark.executor.memory": "7g"
>     "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
>     "spark.io.compression.codec": "lzf"
>     "spark.ui.killEnabled": true
> {code}
>  - The exception the executors die with is this:
> {code}
> 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
> handlers for [TERM, HUP, INT]
> 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
> users with modify permissions: Set(spark, azkaban)
> 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
> 14/11/19 19:42:37 INFO Remoting: Starting remoting
> 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
> 14/11/19 19:42:38 INFO Utils: Successfully started service 
> 'driverPropsFetcher' on port 37682.
> 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
> address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
> now gated for 5000 ms, all messages to this address will be delivered to dead 
> letters. Reason: Connection refused: 
> spark-etl1.chi.shopify.com/172.16.126.88:58849
> 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
> as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
> timed out after [30 seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
> Unknown exception in doAs
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException: 
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>       ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]
>       at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>       at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>       at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>       at scala.concurrent.Await$.result(package.scala:107)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>       ... 7 more
> {code}
> Cluster history:
>  - We run spark versions built from apache/spark#master snapshots. We did not 
> observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry 
> its so old), but now observe it on 
> {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to 
> assist debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

Reply via email to