[ https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Li updated HADOOP-11238: ------------------------------ Status: Patch Available (was: Open) > Group cache expiry causes namenode slowdown > ------------------------------------------- > > Key: HADOOP-11238 > URL: https://issues.apache.org/jira/browse/HADOOP-11238 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.5.1 > Reporter: Chris Li > Assignee: Chris Li > Priority: Minor > Attachments: HADOOP-11238.patch > > > Our namenode pauses for 12-60 seconds several times every hour. During these > pauses, no new requests can come in. > Around the time of pauses, we have log messages such as: > 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential > performance problem: getGroups(user=xxxxx) took 34507 milliseconds. > The current theory is: > 1. Groups has a cache that is refreshed periodically. Each entry has a cache > expiry. > 2. When a cache entry expires, multiple threads can see this expiration and > then we have a thundering herd effect where all these threads hit the wire > and overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with > sssd, how this happens has yet to be established) > 3. group resolution queries begin to take longer, I've observed it taking 1.2 > seconds instead of the usual 0.01-0.03 seconds when measuring in the shell > `time groups myself` > 4. If there is mutual exclusion somewhere along this path, a 1 second pause > could lead to a 60 second pause as all the threads compete for the resource. > The exact cause hasn't been established > Potential solutions include: > 1. Increasing group cache time, which will make the issue less frequent > 2. Rolling evictions of the cache so we prevent the large spike in LDAP > queries > 3. Gate the cache refresh so that only one thread is responsible for > refreshing the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)