[ https://issues.apache.org/jira/browse/MAPREDUCE-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594162#comment-15594162 ]
Karthik Kambatla commented on MAPREDUCE-6797: --------------------------------------------- If multiple threads call {{addIfAbsent}} simultaneously, is it possible they process the same {{HistoryFileInfo}}? How do we ensure only one thread is processing a file? > Job history server scans can become blocked on a single, slow entry > ------------------------------------------------------------------- > > Key: MAPREDUCE-6797 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6797 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Affects Versions: 2.4.0, 2.8.0 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > Priority: Critical > Attachments: jstack > > > There is one more piece of code in HistoryFileManager where Synchronized > keyword on HistoryFileInfo need to be removed. The JobHistoryServer > contention issue is hit on our environment where stacktrace (attached) shows > the HistoryFileManager$JobListCache.addIfAbsent unnecessarily waiting to lock > on HistoryFileInfo. > Synchronized on isMovePending and didMoveFail has been removed by > Mapreduce-6684. > {code} > HistoryFileInfo firstValue = cache.get(key); > synchronized(firstValue) { ---------------> Synchronized is not needed > here > if (firstValue.isMovePending()) { > if(firstValue.didMoveFail() && > firstValue.jobIndexInfo.getFinishTime() <= cutoff) { > cache.remove(key); > //Now lets try to delete it > try { > firstValue.delete(); > } catch (IOException e) { > LOG.error("Error while trying to delete history files" + > " that could not be moved to done.", e); > } > } else { > LOG.warn("Waiting to remove " + key > + " from JobListCache because it is not in done yet."); > } > } else { > cache.remove(key); > } > } > {code} > {code} > Note: stacktrace is from hadoop-2.4.0 version and the problem exists in > latest hadoop as well > "2144820863@qtp-313351300-38156" daemon prio=10 tid=0x0000000001e13800 > nid=0xf133 waiting for monitor entry [0x00007f7c1d8dd000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$JobListCache.addIfAbsent(HistoryFileManager.java:226) > - waiting to lock <0x000000040145c4d8> (a > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:825) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:280) > - locked <0x0000000400375388> (a > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:792) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getAllFileInfo(HistoryFileManager.java:920) > at > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getAllPartialJobs(CachedHistoryStorage.java:156) > at > org.apache.hadoop.mapreduce.v2.hs.JobHistory.getAllJobs(JobHistory.java:235) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org