[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632630#comment-14632630 ]
zhihai xu commented on MAPREDUCE-6436: -------------------------------------- Thanks for working on this issue [~ryu_kobayashi]! It looks like the log will only be printed for the HistoryFileInfo at state {{IN_INTERMEDIATE}} or {{MOVE_FAILED}}. I think most of HistoryFileInfo should be at state {{IN_DONE}}. bq. HistoryFileManager.addIfAbsent method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from JobListCache because it is not in done yet" message The above statement may not be a valid case unless you have a performance issue at HDFS which cause {{HistoryFileInfo#moveToDone}} take very long time. The direct cause for your issue may be a HDFS performance issue. But we can improve the logs to print less message. About your patch, Changing {{scanIfNeeded}} to nonblocking may not be good because the following code at {{HistoryFileManager#getFileInfo}} expects {{jobListCache}} has the entry for the given job after {{scanIntermediateDirectory}} returns, which need block {{scanIfNeeded}}. {code} // OK so scan the intermediate to be sure we did not lose it that way scanIntermediateDirectory(); fileInfo = jobListCache.get(jobId); if (fileInfo != null) { return fileInfo; } {code} Also the implementation of {{scanIfNeeded}} will make sure {{ scanIntermediateDirectory(p);}} will only be called once. {code} if (modTime != newModTime) { Path p = fs.getPath(); try { scanIntermediateDirectory(p); //If scanning fails, we will scan again. We assume the failure is // temporary. modTime = newModTime; } catch (IOException e) { LOG.error("Error while trying to scan the directory " + p, e); } } else { if (LOG.isDebugEnabled()) { LOG.debug("Scan not needed of " + fs.getPath()); } } {code} So the performance overhead for {{scanIfNeeded}} won't be that much. We can make a patch to print less log message. The following logs are printed for HistoryFileInfo at both {{IN_INTERMEDIATE}} state and {{MOVE_FAILED}} state, Can we add two counters: one for {{IN_INTERMEDIATE}} and the other one for {{MOVE_FAILED}}? Also we can save the first key for HistoryFileInfo at state {{IN_INTERMEDIATE}} and the first key for HistoryFileInfo at state {{MOVE_FAILED}}, print these two keys in the logs. {code} } else { LOG.warn("Waiting to remove " + key + " from JobListCache because it is not in done yet."); } {code} > JobHistory cache issue > ---------------------- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Ryu Kobayashi > Assignee: Ryu Kobayashi > Attachments: MAPREDUCE-6436.1.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 50000 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 20000, > HistoryFileManager.addIfAbsent > method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)