GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/13042

    [SPARK-15263][Core] Make shuffle service dir cleanup faster by using `rm 
-rf`

    ## What changes were proposed in this pull request?
    
    Jira: https://issues.apache.org/jira/browse/SPARK-15263
    
    The current logic for directory cleanup is slow because it does directory 
listing, recurses over child directories, checks for symbolic links, deletes 
leaf files and finally deletes the dirs when they are empty. There is 
back-and-forth switching from kernel space to user space while doing this. 
Since most of the deployment backends would be Unix systems, we could 
essentially just do `rm -rf` so that entire deletion logic runs in kernel space.
    
    The current Java based impl in Spark seems to be similar to what standard 
libraries like guava and commons IO do (eg. 
http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540).
 However, guava removed this method in favour of shelling out to an operating 
system command (like in this PR). See the `Deprecated` note in older javadocs 
for guava for details : 
http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File)
    
    Ideally, Java should be providing such APIs so that users won't have to do 
such things to get platform specific code. Also, its not just about speed, but 
also handling race conditions while doing at FS deletions is tricky. I could 
find this bug for Java in similar context : 
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952
    
    ## How was this patch tested?
    
    I am relying on existing test cases to test the method. If there are 
suggestions about testing it, welcome to hear about it.
    
    ## Performance gains
    
    *Input setup* : Created a nested directory structure of depth 3 and each 
entry having 50 sub-dirs. The input being cleaned up had total ~125k dirs.
    
    Ran both approaches (in isolation) for 6 times to get average numbers:
    
    Native Java cleanup  | `rm -rf` as a separate process
    ------------ | -------------
    10.04 sec | 4.11 sec
    
    This change made deletion 2.4 times faster for the given test input.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark delete_recursive

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13042.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13042
    
----
commit 32cc1e63fde168e71a6d392106f551e874889a22
Author: Tejas Patil <tej...@fb.com>
Date:   2016-05-11T01:38:21Z

    [SPARK-15263][Core] Make shuffle service dir cleanup faster by using `rm 
-rf`

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to