[ https://issues.apache.org/jira/browse/MAPREDUCE-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318638#comment-14318638 ]
Craig Welch commented on MAPREDUCE-6252: ---------------------------------------- Not at all, will do. > JobHistoryServer should not fail when encountering a missing directory > ---------------------------------------------------------------------- > > Key: MAPREDUCE-6252 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6252 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Affects Versions: 2.6.0 > Reporter: Craig Welch > Assignee: Craig Welch > Attachments: MAPREDUCE-6252.0.patch > > > The JobHistoryServer maintains a cache of job serial number parts to dfs > paths which it uses when seeking a job it no longer has in its memory cache, > multiple directories for a given serial number differentiated by time stamp. > At present the jobhistory server will fail any time it attempts to find a job > in a directory which no longer exists based on that cache - even though the > job may well exist in a different directory for the serial number. Typically > this is not an issue, but the history cleanup process removes the directory > from dfs before removing it from the cache which leaves a window of time > where a directory may be missing from dfs which is present in the cache, > resulting in failure. For some dfs's it appears that the top level directory > may become unavailable some time before the full deletion of the tree > completes which extends what might otherwise be a brief period of failure to > a more extended period. Further, this also places the service at the mercy > of outside processes which might remove those directories. The proposal is > simply to make the server resistant to this state such that encountering this > missing directory is not fatal and the process will continue on to seek it > elsewhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332)