GitHub user ericvandenbergfb opened a pull request:

    https://github.com/apache/spark/pull/18791

    [SPARK-21571][WEB UI] Spark history server leaves incomplete or unrea…

    …dable history files
    
    around forever.
    
    Fix logic
    1. checkForLogs excluded 0-size files so they stuck around forever.
    2. checkForLogs / mergeApplicationListing indefinitely ignored files
    that were not parseable/couldn't extract an appID, so they stuck around
    forever.
    
    Only apply above logic if spark.history.fs.cleaner.aggressive=true.
    
    Fixed race condition in a test (SPARK-3697: ignore files that cannot be
    read.) where the number of mergeApplicationListings could be more than 1
    since the FsHistoryProvider would spin up an executor that also calls
    checkForLogs in parallel with the test unless spark.testing=true configured.
    
    Added unit test to cover all cases with aggressive and non-aggressive
    clean up logic.
    
    ## What changes were proposed in this pull request?
    
    The spark history server doesn't clean up certain history files outside the 
retention window leading to thousands of such files lingering around on our 
servers.  The log checking and clean up logic skipped 0 byte files and expired 
inprogress or complete history files that weren't properly parseable (not able 
to extract an app id or otherwise parse...)  Note these files most likely 
appeared to due aborted jobs or earlier spark/file system driver bugs.  To 
mitigate this, FsHistoryProvider.checkForLogs now internally identifies these 
untracked files and will remove them if they expire outside the cleaner 
retention window.
    
    This is currently controlled via configuration 
spark.history.fs.cleaner.aggressive=true to perform more aggressive cleaning.
    
    ## How was this patch tested?
    
    Implemented a unit test that exercises the above cases without and without 
the aggressive cleaning to ensure correct results in all cases.  Note that 
FsHistoryProvider at one place uses the file system to get the current time and 
and at other times the local system time, this seems inconsistent/buggy but I 
did not attempt to fix in this commit.  I was forced to change one of the 
method FsHistoryProvider.getNewLastScanTime() for the test to properly mock the 
clock.  
    
    Also ran a history server and touched some files to verify they were 
properly removed.
    
    ericvandenberg@localhost /tmp/spark-events % ls -la
    total 808K
    drwxr-xr-x   8 ericvandenberg  272 Jul 31 18:22 .
    drwxrwxrwt 127 root
    -rw-r--r--   1 ericvandenberg    0 Jan  1  2016 local-123.inprogress
    -rwxr-x---   1 ericvandenberg 342K Jan  1  2016 local-1501549952084
    -rwxrwx---   1 ericvandenberg 342K Jan  1  2016 
local-1501549952084.inprogress
    -rwxrwx---   1 ericvandenberg  59K Jul 31 18:19 local-1501550073208
    -rwxrwx---   1 ericvandenberg  59K Jul 31 18:21 
local-1501550473508.inprogress
    -rw-r--r--   1 ericvandenberg    0 Jan  1  2016 local-234
    
    Observed in history server logs:
    
    17/07/31 18:23:52 INFO FsHistoryProvider: Aggressively cleaned up 4 
untracked history files. 
    
    ericvandenberg@localhost /tmp/spark-events % ls -la 
    total 120K
    drwxr-xr-x   4 ericvandenberg  136 Jul 31 18:24 .
    drwxrwxrwt 127 root           4.3K Jul 31 18:07 ..
    -rwxrwx---   1 ericvandenberg  59K Jul 31 18:19 local-1501550073208
    -rwxrwx---   1 ericvandenberg  59K Jul 31 18:22 local-1501550473508


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericvandenbergfb/spark 
cleanup.untracked.history.files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18791.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18791
    
----
commit c52b1cfd2eee9c881267d3d4cd9ea83fb6a767eb
Author: Eric Vandenberg <ericvandenb...@fb.com>
Date:   2017-07-31T22:02:54Z

    [SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable 
history files
    around forever.
    
    Fix logic
    1. checkForLogs excluded 0-size files so they stuck around forever.
    2. checkForLogs / mergeApplicationListing indefinitely ignored files
    that were not parseable/couldn't extract an appID, so they stuck around
    forever.
    
    Only apply above logic if spark.history.fs.cleaner.aggressive=true.
    
    Fixed race condition in a test (SPARK-3697: ignore files that cannot be
    read.) where the number of mergeApplicationListings could be more than 1
    since the FsHistoryProvider would spin up an executor that also calls
    checkForLogs in parallel with the test.
    
    Added unit test to cover all cases with aggressive and non-aggressive
    clean up logic.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to