[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057241#comment-15057241
 ] 

Kai Sasaki commented on MAPREDUCE-6436:
---------------------------------------

[~djp] Yes, as described above, scanIfNeeded slowness makes 
HistroyClientServer.HDSClientProtocolHandler.getJobReport slow that is called 
from job client. In some cases, it causes a performance issue of the job. 
But usually retuned from JobListCached retained by HistoryFileManager in this 
case scanIntermediateDirectory won't be required. So we cannot say that the 
performance issue is occurred immediately if there are a lot of failed and 
pending job logs in intermediate directory.
I'm not sure we should set the JIRA as a blocker or not though...

> JobHistory cache issue
> ----------------------
>
>                 Key: MAPREDUCE-6436
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Ryu Kobayashi
>            Assignee: Kai Sasaki
>         Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, 
> MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, 
> stacktrace2.txt, stacktrace3.txt
>
>
> Problem: 
> HistoryFileManager.addIfAbsent produces large amount of logs if number of
> cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
> larger than mapreduce.jobhistory.joblist.cache.size by far.
> Example:
> For example, if the cache contains 50000 entries in total and 10,000 entries
> newer than mapreduce.jobhistory.max-age-ms where
> mapreduce.jobhistory.joblist.cache.size is 20000, 
> HistoryFileManager.addIfAbsent
> method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
> JobListCache because it is not in done yet" message.
> It will attach a stacktrace.
> Impact:
> In addition to large disk consumption, this issue blocks JobHistory.getJob
> long time and slows job execution down significantly because getJob is called
> by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
> This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
> eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
> multiple threads call scanIfNeeded simultaneously, one of them acquires lock
> and the other threads are blocked until the first thread completes 
> long-running
> HistoryFileManager.addIfAbsent call.
> Solution: 
> * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take 
> too long time.
> * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
>   scanning if another thread is already scanning. This changes semantics of
>   some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
>   because scanIfNeeded keep outdated state.
> * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls 
> are
>   not blocked by a loop at scale of tens of thousands.
>  
> This patch implemented the first item.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to