[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue

zhihai xu (JIRA) Sat, 18 Jul 2015 16:03:04 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632630#comment-14632630
 ]


zhihai xu commented on MAPREDUCE-6436:
--------------------------------------

Thanks for working on this issue [~ryu_kobayashi]! It looks like the log will 
only be printed for the HistoryFileInfo at state {{IN_INTERMEDIATE}} or 
{{MOVE_FAILED}}. I think most of HistoryFileInfo should be at state {{IN_DONE}}.
bq. HistoryFileManager.addIfAbsent method produces 50000 - 20000 = 30000 lines 
of "Waiting to remove <key> from JobListCache because it is not in done yet" 
message
The above statement may not be a valid case unless you have a performance issue 
at HDFS which cause {{HistoryFileInfo#moveToDone}} take very long time.
The direct cause for your issue may be a HDFS performance issue. But we can 
improve the logs to print less message.
About your patch, Changing {{scanIfNeeded}} to nonblocking may not be good 
because the following code at {{HistoryFileManager#getFileInfo}} expects 
{{jobListCache}} has the entry for the given job after 
{{scanIntermediateDirectory}} returns, which need block {{scanIfNeeded}}.
{code}
    // OK so scan the intermediate to be sure we did not lose it that way
    scanIntermediateDirectory();
    fileInfo = jobListCache.get(jobId);
    if (fileInfo != null) {
      return fileInfo;
    }
{code}
Also the implementation of {{scanIfNeeded}} will make sure {{ 
scanIntermediateDirectory(p);}} will only be called once.
{code}
    if (modTime != newModTime) {
        Path p = fs.getPath();
        try {
          scanIntermediateDirectory(p);
          //If scanning fails, we will scan again.  We assume the failure is
          // temporary.
          modTime = newModTime;
        } catch (IOException e) {
          LOG.error("Error while trying to scan the directory " + p, e);
        }
      } else {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Scan not needed of " + fs.getPath());
        }
      }
{code}
So the performance overhead for {{scanIfNeeded}} won't be that much.

We can make a patch to print less log message. The following logs are printed 
for HistoryFileInfo at both {{IN_INTERMEDIATE}} state and {{MOVE_FAILED}} 
state, Can we add two counters: one for {{IN_INTERMEDIATE}} and the other one 
for {{MOVE_FAILED}}?
Also we can save the first key for HistoryFileInfo at state {{IN_INTERMEDIATE}} 
and the first key for HistoryFileInfo at state {{MOVE_FAILED}}, print these two 
keys in the logs.
{code}
                } else {
                  LOG.warn("Waiting to remove " + key
                      + " from JobListCache because it is not in done yet.");
                }
{code}

> JobHistory cache issue
> ----------------------
>
>                 Key: MAPREDUCE-6436
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Ryu Kobayashi
>            Assignee: Ryu Kobayashi
>         Attachments: MAPREDUCE-6436.1.patch, stacktrace1.txt, 
> stacktrace2.txt, stacktrace3.txt
>
>
> Problem: 
> HistoryFileManager.addIfAbsent produces large amount of logs if number of
> cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
> larger than mapreduce.jobhistory.joblist.cache.size by far.
> Example:
> For example, if the cache contains 50000 entries in total and 10,000 entries
> newer than mapreduce.jobhistory.max-age-ms where
> mapreduce.jobhistory.joblist.cache.size is 20000, 
> HistoryFileManager.addIfAbsent
> method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
> JobListCache because it is not in done yet" message.
> It will attach a stacktrace.
> Impact:
> In addition to large disk consumption, this issue blocks JobHistory.getJob
> long time and slows job execution down significantly because getJob is called
> by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
> This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
> eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
> multiple threads call scanIfNeeded simultaneously, one of them acquires lock
> and the other threads are blocked until the first thread completes 
> long-running
> HistoryFileManager.addIfAbsent call.
> Solution: 
> * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take 
> too long time.
> * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
>   scanning if another thread is already scanning. This changes semantics of
>   some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
>   because scanIfNeeded keep outdated state.
> * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls 
> are
>   not blocked by a loop at scale of tens of thousands.
>  
> This patch implemented the first item.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6436) JobHistory cache issue

Reply via email to