GitHub user RussellSpitzer opened a pull request:
https://github.com/apache/spark/pull/20298
[SPARK-22976][Core]: Cluster mode driver dir removed while running
## What changes were proposed in this pull request?
The clean up logic on the worker perviously determined the liveness of a
particular applicaiton based on whether or not it had running executors.
This would fail in the case that a directory was made for a driver
running in cluster mode if that driver had no running executors on the
same machine. To preserve driver directories we consider both executors
and running drivers when checking directory liveness.
## How was this patch tested?
Manually started up two node cluster with a single core on each node.
Turned on worker directory cleanup and set the interval to 1 second and
liveness to one second. Without the patch the driver directory is removed
immediately after the app is launched. With the patch it is not
### Without Patch
```
INFO 2018-01-05 23:48:24,693 Logging.scala:54 - Asked to launch driver
driver-20180105234824-
INFO 2018-01-05 23:48:25,293 Logging.scala:54 - Changing view acls to:
cassandra
INFO 2018-01-05 23:48:25,293 Logging.scala:54 - Changing modify acls to:
cassandra
INFO 2018-01-05 23:48:25,294 Logging.scala:54 - Changing view acls groups
to:
INFO 2018-01-05 23:48:25,294 Logging.scala:54 - Changing modify acls
groups to:
INFO 2018-01-05 23:48:25,294 Logging.scala:54 - SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(cassandra); groups with view permissions: Set(); users with modify
permissions: Set(cassandra); groups with modify permissions: Set()
INFO 2018-01-05 23:48:25,330 Logging.scala:54 - Copying user jar
file:/home/automaton/writeRead-0.1.jar to
/var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar
INFO 2018-01-05 23:48:25,332 Logging.scala:54 - Copying
/home/automaton/writeRead-0.1.jar to
/var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar
INFO 2018-01-05 23:48:25,361 Logging.scala:54 - Launch Command:
"/usr/lib/jvm/jdk1.8.0_40//bin/java"
INFO 2018-01-05 23:48:56,577 Logging.scala:54 - Removing directory:
/var/lib/spark/worker/driver-20180105234824- ### << Cleaned up
--
One minute passes while app runs (app has 1 minute sleep built in)
--
WARN 2018-01-05 23:49:58,080 ShuffleSecretManager.java:73 - Attempted to
unregister application app-20180105234831- when it is not registered
INFO 2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 -
Application app-20180105234831- removed, cleanupLocalDirs = false
INFO 2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 -
Application app-20180105234831- removed, cleanupLocalDirs = false
INFO 2018-01-05 23:49:58,082 ExternalShuffleBlockResolver.java:163 -
Application app-20180105234831- removed, cleanupLocalDirs = true
INFO 2018-01-05 23:50:00,999 Logging.scala:54 - Driver
driver-20180105234824- exited successfully
```
With Patch
```
INFO 2018-01-08 23:19:54,603 Logging.scala:54 - Asked to launch driver
driver-20180108231954-0002
INFO 2018-01-08 23:19:54,975 Logging.scala:54 - Changing view acls to:
automaton
INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls to:
automaton
INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing view acls groups
to:
INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls
groups to:
INFO 2018-01-08 23:19:54,976 Logging.scala:54 - SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(automaton); groups with view permissions: Set(); users with modify
permissions: Set(automaton); groups with modify permissions: Set()
INFO 2018-01-08 23:19:55,029 Logging.scala:54 - Copying user jar
file:/home/automaton/writeRead-0.1.jar to
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
INFO 2018-01-08 23:19:55,031 Logging.scala:54 - Copying
/home/automaton/writeRead-0.1.jar to
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
INFO 2018-01-08 23:19:55,038 Logging.scala:54 - Launch Command: ..
INFO 2018-01-08 23:21:28,674 ShuffleSecretManager.java:69 - Unregistered
shuffle secret for application app-20180108232000-
INFO 2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 -
Application app-20180108232000- removed, cleanupLocalDirs = false
INFO 2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 -
Application app-20180108232000- removed, cleanupLocalDirs = false
INFO 2018-01-08 23:21:28,681 ExternalShuffleBlockResolver.java:163 -
Application app-20180108232000- removed, cleanupLocalDirs = true
INFO 2018-01-08 23:21:31,703 Logging.scala:54 - Driver