[ https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205437#comment-14205437 ]
Hadoop QA commented on HADOOP-11238: ------------------------------------ {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680662/HADOOP-11238.patch against trunk revision eace218. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.security.ssl.TestReloadingX509TrustManager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5060//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5060//console This message is automatically generated. > Group cache expiry causes namenode slowdown > ------------------------------------------- > > Key: HADOOP-11238 > URL: https://issues.apache.org/jira/browse/HADOOP-11238 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.5.1 > Reporter: Chris Li > Assignee: Chris Li > Priority: Minor > Attachments: HADOOP-11238.patch > > > Our namenode pauses for 12-60 seconds several times every hour. During these > pauses, no new requests can come in. > Around the time of pauses, we have log messages such as: > 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential > performance problem: getGroups(user=xxxxx) took 34507 milliseconds. > The current theory is: > 1. Groups has a cache that is refreshed periodically. Each entry has a cache > expiry. > 2. When a cache entry expires, multiple threads can see this expiration and > then we have a thundering herd effect where all these threads hit the wire > and overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with > sssd, how this happens has yet to be established) > 3. group resolution queries begin to take longer, I've observed it taking 1.2 > seconds instead of the usual 0.01-0.03 seconds when measuring in the shell > `time groups myself` > 4. If there is mutual exclusion somewhere along this path, a 1 second pause > could lead to a 60 second pause as all the threads compete for the resource. > The exact cause hasn't been established > Potential solutions include: > 1. Increasing group cache time, which will make the issue less frequent > 2. Rolling evictions of the cache so we prevent the large spike in LDAP > queries > 3. Gate the cache refresh so that only one thread is responsible for > refreshing the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)