[ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhihai xu updated MAPREDUCE-6436: --------------------------------- Fix Version/s: 2.6.4 2.7.3 2.8.0 > JobHistory cache issue > ---------------------- > > Key: MAPREDUCE-6436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Ryu Kobayashi > Assignee: Kai Sasaki > Priority: Blocker > Fix For: 2.8.0, 2.7.3, 2.6.4 > > Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, > MAPREDUCE-6436.3.patch, MAPREDUCE-6436.4.patch, stacktrace1.txt, > stacktrace2.txt, stacktrace3.txt > > > Problem: > HistoryFileManager.addIfAbsent produces large amount of logs if number of > cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes > larger than mapreduce.jobhistory.joblist.cache.size by far. > Example: > For example, if the cache contains 50000 entries in total and 10,000 entries > newer than mapreduce.jobhistory.max-age-ms where > mapreduce.jobhistory.joblist.cache.size is 20000, > HistoryFileManager.addIfAbsent > method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from > JobListCache because it is not in done yet" message. > It will attach a stacktrace. > Impact: > In addition to large disk consumption, this issue blocks JobHistory.getJob > long time and slows job execution down significantly because getJob is called > by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport. > This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded > eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When > multiple threads call scanIfNeeded simultaneously, one of them acquires lock > and the other threads are blocked until the first thread completes > long-running > HistoryFileManager.addIfAbsent call. > Solution: > * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take > too long time. > * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips > scanning if another thread is already scanning. This changes semantics of > some HistoryFileManager methods (such as getAllFileInfo and getFileInfo) > because scanIfNeeded keep outdated state. > * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls > are > not blocked by a loop at scale of tens of thousands. > > This patch implemented the first item. -- This message was sent by Atlassian JIRA (v6.3.4#6332)