[ https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-21571: ------------------------------------ Assignee: Apache Spark > Spark history server leaves incomplete or unreadable history files around > forever. > ---------------------------------------------------------------------------------- > > Key: SPARK-21571 > URL: https://issues.apache.org/jira/browse/SPARK-21571 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.2.0 > Reporter: Eric Vandenberg > Assignee: Apache Spark > Priority: Minor > > We have noticed that history server logs are sometimes never cleaned up. The > current history server logic *ONLY* cleans up history files if they are > completed since in general it doesn't make sense to clean up inprogress > history files (after all, the job is presumably still running?) Note that > inprogress history files would generally not be targeted for clean up any way > assuming they regularly flush logs and the file system accurately updates the > history log last modified time/size, while this is likely it is not > guaranteed behavior. > As a consequence of the current clean up logic and a combination of unclean > shutdowns, various file system bugs, earlier spark bugs, etc. we have > accumulated thousands of these dead history files associated with long since > gone jobs. > For example (with spark.history.fs.cleaner.maxAge=14d): > -rw-rw---- 3 xxxxxx ooooooo > 14382 2016-09-13 15:40 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq1974_ppppppppppp-8812_110586000000195_dev4384_jjjjjjjjjjjj-53982.zstandard > -rw-rw---- 3 xxxx ooooooo > 5933 2016-11-01 20:16 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq2016_ppppppppppp-8812_126507000000673_dev5365_jjjjjjjjjjjj-65313.lz4 > -rw-rw---- 3 yyy ooooooo > 0 2017-01-19 11:59 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0057_zzzz326_mmmmmmmmm-57863.lz4.inprogress > -rw-rw---- 3 xxxxxxxxx ooooooo > 0 2017-01-19 14:17 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0063_zzzz688_mmmmmmmmm-33246.lz4.inprogress > -rw-rw---- 3 yyy ooooooo > 0 2017-01-20 10:56 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1030_zzzz326_mmmmmmmmm-45195.lz4.inprogress > -rw-rw---- 3 xxxxxxxxxxxx ooooooo > 11955 2017-01-20 17:55 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1314_wwww54_kkkkkkkkkkkkkk-64671.lz4.inprogress > -rw-rw---- 3 xxxxxxxxxxxx ooooooo > 11958 2017-01-20 17:55 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1315_wwww1667_kkkkkkkkkkkkkk-58968.lz4.inprogress > -rw-rw---- 3 xxxxxxxxxxxx ooooooo > 11960 2017-01-20 17:55 > /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1316_wwww54_kkkkkkkkkkkkkk-48058.lz4.inprogress > Based on the current logic, clean up candidates are skipped in several cases: > 1. if a file has 0 bytes, it is completely ignored > 2. if a file is in progress and not paresable/can't extract appID, is it > completely ignored > 3. if a file is complete and but not parseable/can't extract appID, it is > completely ignored. > To address this edge case and provide a way to clean out orphaned history > files I propose a new configuration option: > spark.history.fs.cleaner.aggressive={true, false}, default is false. > If true, the history server will more aggressively garbage collect history > files in cases (1), (2) and (3). Since the default is false, existing > customers won't be affected unless they explicitly opt-in. If customers have > similar leaking garbage over time they have the option of aggressively > cleaning up in such cases. Also note that aggressive clean up may not be > appropriate for some customers if they have long running jobs that exceed the > cleaner.maxAge time frame and/or have buggy file systems. > Would like to get feedback on if this seems like a reasonable solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org