GitHub user RussellSpitzer opened a pull request:

    https://github.com/apache/spark/pull/20298

    [SPARK-22976][Core]: Cluster mode driver dir removed while running

    ## What changes were proposed in this pull request?
    
    The clean up logic on the worker perviously determined the liveness of a
    particular applicaiton based on whether or not it had running executors.
    This would fail in the case that a directory was made for a driver
    running in cluster mode if that driver had no running executors on the
    same machine. To preserve driver directories we consider both executors
    and running drivers when checking directory liveness.
    
    ## How was this patch tested?
    
    Manually started up two node cluster with a single core on each node. 
Turned on worker directory cleanup and set the interval to 1 second and 
liveness to one second. Without the patch the driver directory is removed 
immediately after the app is launched. With the patch it is not
    
    
    ### Without Patch
    ```
    INFO  2018-01-05 23:48:24,693 Logging.scala:54 - Asked to launch driver 
driver-20180105234824-0000
    INFO  2018-01-05 23:48:25,293 Logging.scala:54 - Changing view acls to: 
cassandra
    INFO  2018-01-05 23:48:25,293 Logging.scala:54 - Changing modify acls to: 
cassandra
    INFO  2018-01-05 23:48:25,294 Logging.scala:54 - Changing view acls groups 
to:
    INFO  2018-01-05 23:48:25,294 Logging.scala:54 - Changing modify acls 
groups to:
    INFO  2018-01-05 23:48:25,294 Logging.scala:54 - SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(cassandra); groups with view permissions: Set(); users  with modify 
permissions: Set(cassandra); groups with modify permissions: Set()
    INFO  2018-01-05 23:48:25,330 Logging.scala:54 - Copying user jar 
file:/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180105234824-0000/writeRead-0.1.jar
    INFO  2018-01-05 23:48:25,332 Logging.scala:54 - Copying 
/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180105234824-0000/writeRead-0.1.jar
    INFO  2018-01-05 23:48:25,361 Logging.scala:54 - Launch Command: 
"/usr/lib/jvm/jdk1.8.0_40//bin/java" ....
    ****
    INFO  2018-01-05 23:48:56,577 Logging.scala:54 - Removing directory: 
/var/lib/spark/worker/driver-20180105234824-0000  ### << Cleaned up
    ****
    -- 
    One minute passes while app runs (app has 1 minute sleep built in)
    --
    
    WARN  2018-01-05 23:49:58,080 ShuffleSecretManager.java:73 - Attempted to 
unregister application app-20180105234831-0000 when it is not registered
    INFO  2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831-0000 removed, cleanupLocalDirs = false
    INFO  2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831-0000 removed, cleanupLocalDirs = false
    INFO  2018-01-05 23:49:58,082 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831-0000 removed, cleanupLocalDirs = true
    INFO  2018-01-05 23:50:00,999 Logging.scala:54 - Driver 
driver-20180105234824-0000 exited successfully
    ```
    
    With Patch
    ```
    INFO  2018-01-08 23:19:54,603 Logging.scala:54 - Asked to launch driver 
driver-20180108231954-0002
    INFO  2018-01-08 23:19:54,975 Logging.scala:54 - Changing view acls to: 
automaton
    INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls to: 
automaton
    INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing view acls groups 
to:
    INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls 
groups to:
    INFO  2018-01-08 23:19:54,976 Logging.scala:54 - SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(automaton); groups with view permissions: Set(); users  with modify 
permissions: Set(automaton); groups with modify permissions: Set()
    INFO  2018-01-08 23:19:55,029 Logging.scala:54 - Copying user jar 
file:/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
    INFO  2018-01-08 23:19:55,031 Logging.scala:54 - Copying 
/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
    INFO  2018-01-08 23:19:55,038 Logging.scala:54 - Launch Command: ......
    INFO  2018-01-08 23:21:28,674 ShuffleSecretManager.java:69 - Unregistered 
shuffle secret for application app-20180108232000-0000
    INFO  2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000-0000 removed, cleanupLocalDirs = false
    INFO  2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000-0000 removed, cleanupLocalDirs = false
    INFO  2018-01-08 23:21:28,681 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000-0000 removed, cleanupLocalDirs = true
    INFO  2018-01-08 23:21:31,703 Logging.scala:54 - Driver 
driver-20180108231954-0002 exited successfully
    *****
    INFO  2018-01-08 23:21:32,346 Logging.scala:54 - Removing directory: 
/var/lib/spark/worker/driver-20180108231954-0002 ### < Happening AFTER the Run 
completes rather than during it
    *****
    ```
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/RussellSpitzer/spark SPARK-22976-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20298.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20298
    
----
commit 38916f769252938fbce891cf1d21972e50a01181
Author: Russell Spitzer <russell.spitzer@...>
Date:   2018-01-17T20:13:57Z

    [SPARK-22976][Core]: Cluster mode driver dir removed while running
    
    The clean up logic on the worker perviously determined the liveness of a
    particular applicaiton based on whether or not it had running executors.
    This would fail in the case that a directory was made for a driver
    running in cluster mode if that driver had no running executors on the
    same machine. To preserve driver directories we consider both executors
    and running drivers when checking directory liveness.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to