[ https://issues.apache.org/jira/browse/MAPREDUCE-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514007#comment-14514007 ]
Hudson commented on MAPREDUCE-6252: ----------------------------------- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/167/]) MAPREDUCE-6252. JobHistoryServer should not fail when encountering a (devaraj: rev 5e67c4d384193b38a85655c8f93193596821faa5) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestHistoryFileManager.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java * hadoop-mapreduce-project/CHANGES.txt > JobHistoryServer should not fail when encountering a missing directory > ---------------------------------------------------------------------- > > Key: MAPREDUCE-6252 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6252 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Affects Versions: 2.6.0 > Reporter: Craig Welch > Assignee: Craig Welch > Fix For: 2.8.0 > > Attachments: MAPREDUCE-6252.0.patch, MAPREDUCE-6252.1.patch > > > The JobHistoryServer maintains a cache of job serial number parts to dfs > paths which it uses when seeking a job it no longer has in its memory cache, > multiple directories for a given serial number differentiated by time stamp. > At present the jobhistory server will fail any time it attempts to find a job > in a directory which no longer exists based on that cache - even though the > job may well exist in a different directory for the serial number. Typically > this is not an issue, but the history cleanup process removes the directory > from dfs before removing it from the cache which leaves a window of time > where a directory may be missing from dfs which is present in the cache, > resulting in failure. For some dfs's it appears that the top level directory > may become unavailable some time before the full deletion of the tree > completes which extends what might otherwise be a brief period of failure to > a more extended period. Further, this also places the service at the mercy > of outside processes which might remove those directories. The proposal is > simply to make the server resistant to this state such that encountering this > missing directory is not fatal and the process will continue on to seek it > elsewhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332)