[
https://issues.apache.org/jira/browse/MAPREDUCE-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594162#comment-15594162
]
Karthik Kambatla commented on MAPREDUCE-6797:
---------------------------------------------
If multiple threads call {{addIfAbsent}} simultaneously, is it possible they
process the same {{HistoryFileInfo}}? How do we ensure only one thread is
processing a file?
> Job history server scans can become blocked on a single, slow entry
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-6797
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6797
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 2.4.0, 2.8.0
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
> Priority: Critical
> Attachments: jstack
>
>
> There is one more piece of code in HistoryFileManager where Synchronized
> keyword on HistoryFileInfo need to be removed. The JobHistoryServer
> contention issue is hit on our environment where stacktrace (attached) shows
> the HistoryFileManager$JobListCache.addIfAbsent unnecessarily waiting to lock
> on HistoryFileInfo.
> Synchronized on isMovePending and didMoveFail has been removed by
> Mapreduce-6684.
> {code}
> HistoryFileInfo firstValue = cache.get(key);
> synchronized(firstValue) { ---------------> Synchronized is not needed
> here
> if (firstValue.isMovePending()) {
> if(firstValue.didMoveFail() &&
> firstValue.jobIndexInfo.getFinishTime() <= cutoff) {
> cache.remove(key);
> //Now lets try to delete it
> try {
> firstValue.delete();
> } catch (IOException e) {
> LOG.error("Error while trying to delete history files" +
> " that could not be moved to done.", e);
> }
> } else {
> LOG.warn("Waiting to remove " + key
> + " from JobListCache because it is not in done yet.");
> }
> } else {
> cache.remove(key);
> }
> }
> {code}
> {code}
> Note: stacktrace is from hadoop-2.4.0 version and the problem exists in
> latest hadoop as well
> "2144820863@qtp-313351300-38156" daemon prio=10 tid=0x0000000001e13800
> nid=0xf133 waiting for monitor entry [0x00007f7c1d8dd000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$JobListCache.addIfAbsent(HistoryFileManager.java:226)
> - waiting to lock <0x000000040145c4d8> (a
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo)
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:825)
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280)
> - locked <0x0000000400375388> (a
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir)
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792)
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getAllFileInfo(HistoryFileManager.java:920)
> at
> org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getAllPartialJobs(CachedHistoryStorage.java:156)
> at
> org.apache.hadoop.mapreduce.v2.hs.JobHistory.getAllJobs(JobHistory.java:235)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]