[ 
https://issues.apache.org/jira/browse/SPARK-15263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15263:
--------------------------------
    Description: 
The current logic for directory cleanup (JavaUtils. deleteRecursively) is slow 
because it does directory listing, recurses over child directories, checks for 
symbolic links, deletes leaf files and finally deletes the dirs when they are 
empty. There is back-and-forth switching from kernel space to user space while 
doing this. Since most of the deployment backends would be Unix systems, we 
could essentially just do rm -rf so that entire deletion logic runs in kernel 
space.

The current Java based impl in Spark seems to be similar to what standard 
libraries like guava and commons IO do (eg. 
http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540).
 However, guava removed this method in favour of shelling out to an operating 
system command (which is exactly what I am proposing). See the Deprecated note 
in older javadocs for guava for details : 
http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File)

Ideally, Java should be providing such APIs so that users won't have to do such 
things to get platform specific code. Also, its not just about speed, but also 
handling race conditions while doing at FS deletions is tricky. I could find 
this bug for Java in similar context : 
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952

> Make shuffle service dir cleanup faster by using `rm -rf`
> ---------------------------------------------------------
>
>                 Key: SPARK-15263
>                 URL: https://issues.apache.org/jira/browse/SPARK-15263
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Tejas Patil
>            Priority: Minor
>
> The current logic for directory cleanup (JavaUtils. deleteRecursively) is 
> slow because it does directory listing, recurses over child directories, 
> checks for symbolic links, deletes leaf files and finally deletes the dirs 
> when they are empty. There is back-and-forth switching from kernel space to 
> user space while doing this. Since most of the deployment backends would be 
> Unix systems, we could essentially just do rm -rf so that entire deletion 
> logic runs in kernel space.
> The current Java based impl in Spark seems to be similar to what standard 
> libraries like guava and commons IO do (eg. 
> http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540).
>  However, guava removed this method in favour of shelling out to an operating 
> system command (which is exactly what I am proposing). See the Deprecated 
> note in older javadocs for guava for details : 
> http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File)
> Ideally, Java should be providing such APIs so that users won't have to do 
> such things to get platform specific code. Also, its not just about speed, 
> but also handling race conditions while doing at FS deletions is tricky. I 
> could find this bug for Java in similar context : 
> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to