[jira] [Updated] (HADOOP-11238) Group cache expiry causes namenode slowdown

2014-11-10 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-11238:
--
Status: Patch Available  (was: Open)

 Group cache expiry causes namenode slowdown
 ---

 Key: HADOOP-11238
 URL: https://issues.apache.org/jira/browse/HADOOP-11238
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Chris Li
Assignee: Chris Li
Priority: Minor
 Attachments: HADOOP-11238.patch


 Our namenode pauses for 12-60 seconds several times every hour. During these 
 pauses, no new requests can come in.
 Around the time of pauses, we have log messages such as:
 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
 performance problem: getGroups(user=x) took 34507 milliseconds.
 The current theory is:
 1. Groups has a cache that is refreshed periodically. Each entry has a cache 
 expiry.
 2. When a cache entry expires, multiple threads can see this expiration and 
 then we have a thundering herd effect where all these threads hit the wire 
 and overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with 
 sssd, how this happens has yet to be established)
 3. group resolution queries begin to take longer, I've observed it taking 1.2 
 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
 `time groups myself`
 4. If there is mutual exclusion somewhere along this path, a 1 second pause 
 could lead to a 60 second pause as all the threads compete for the resource. 
 The exact cause hasn't been established
 Potential solutions include:
 1. Increasing group cache time, which will make the issue less frequent
 2. Rolling evictions of the cache so we prevent the large spike in LDAP 
 queries
 3. Gate the cache refresh so that only one thread is responsible for 
 refreshing the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11238) Group cache expiry causes namenode slowdown

2014-11-10 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-11238:
--
Attachment: HADOOP-11238.patch

Uploading patch

 Group cache expiry causes namenode slowdown
 ---

 Key: HADOOP-11238
 URL: https://issues.apache.org/jira/browse/HADOOP-11238
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Chris Li
Assignee: Chris Li
Priority: Minor
 Attachments: HADOOP-11238.patch


 Our namenode pauses for 12-60 seconds several times every hour. During these 
 pauses, no new requests can come in.
 Around the time of pauses, we have log messages such as:
 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
 performance problem: getGroups(user=x) took 34507 milliseconds.
 The current theory is:
 1. Groups has a cache that is refreshed periodically. Each entry has a cache 
 expiry.
 2. When a cache entry expires, multiple threads can see this expiration and 
 then we have a thundering herd effect where all these threads hit the wire 
 and overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with 
 sssd, how this happens has yet to be established)
 3. group resolution queries begin to take longer, I've observed it taking 1.2 
 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
 `time groups myself`
 4. If there is mutual exclusion somewhere along this path, a 1 second pause 
 could lead to a 60 second pause as all the threads compete for the resource. 
 The exact cause hasn't been established
 Potential solutions include:
 1. Increasing group cache time, which will make the issue less frequent
 2. Rolling evictions of the cache so we prevent the large spike in LDAP 
 queries
 3. Gate the cache refresh so that only one thread is responsible for 
 refreshing the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11238) Group cache expiry causes namenode slowdown

2014-10-28 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-11238:
--
Description: 
Our namenode pauses for 12-60 seconds several times every hour or so. During 
these pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
performance problem: getGroups(user=x) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. Each entry has a cache 
expiry.
2. When a cache entry expires, multiple threads can see this expiration and 
then we have a thundering herd effect where all these threads hit the wire and 
overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, 
how this happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 
seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
`time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause 
could lead to a 60 second pause as all the threads compete for the resource. 
The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries
3. Gate the cache refresh so that only one thread is responsible for refreshing 
the cache



  was:
Our namenode pauses for 12-60 seconds several times every hour or so. During 
these pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
performance problem: getGroups(user=x) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. 
2. When the cache is cleared, we have a thundering herd effect which overwhelms 
our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, how this 
happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 
seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
`time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause 
could lead to a 60 second pause as all the threads compete for the resource. 
The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries




 Group cache expiry causes namenode slowdown
 ---

 Key: HADOOP-11238
 URL: https://issues.apache.org/jira/browse/HADOOP-11238
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Chris Li
Assignee: Chris Li
Priority: Minor

 Our namenode pauses for 12-60 seconds several times every hour or so. During 
 these pauses, no new requests can come in.
 Around the time of pauses, we have log messages such as:
 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
 performance problem: getGroups(user=x) took 34507 milliseconds.
 The current theory is:
 1. Groups has a cache that is refreshed periodically. Each entry has a cache 
 expiry.
 2. When a cache entry expires, multiple threads can see this expiration and 
 then we have a thundering herd effect where all these threads hit the wire 
 and overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with 
 sssd, how this happens has yet to be established)
 3. group resolution queries begin to take longer, I've observed it taking 1.2 
 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
 `time groups myself`
 4. If there is mutual exclusion somewhere along this path, a 1 second pause 
 could lead to a 60 second pause as all the threads compete for the resource. 
 The exact cause hasn't been established
 Potential solutions include:
 1. Increasing group cache time, which will make the issue less frequent
 2. Rolling evictions of the cache so we prevent the large spike in LDAP 
 queries
 3. Gate the cache refresh so that only one thread is responsible for 
 refreshing the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11238) Group cache expiry causes namenode slowdown

2014-10-28 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-11238:
--
Description: 
Our namenode pauses for 12-60 seconds several times every hour. During these 
pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
performance problem: getGroups(user=x) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. Each entry has a cache 
expiry.
2. When a cache entry expires, multiple threads can see this expiration and 
then we have a thundering herd effect where all these threads hit the wire and 
overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, 
how this happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 
seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
`time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause 
could lead to a 60 second pause as all the threads compete for the resource. 
The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries
3. Gate the cache refresh so that only one thread is responsible for refreshing 
the cache



  was:
Our namenode pauses for 12-60 seconds several times every hour or so. During 
these pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
performance problem: getGroups(user=x) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. Each entry has a cache 
expiry.
2. When a cache entry expires, multiple threads can see this expiration and 
then we have a thundering herd effect where all these threads hit the wire and 
overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, 
how this happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 
seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
`time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause 
could lead to a 60 second pause as all the threads compete for the resource. 
The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries
3. Gate the cache refresh so that only one thread is responsible for refreshing 
the cache




 Group cache expiry causes namenode slowdown
 ---

 Key: HADOOP-11238
 URL: https://issues.apache.org/jira/browse/HADOOP-11238
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Chris Li
Assignee: Chris Li
Priority: Minor

 Our namenode pauses for 12-60 seconds several times every hour. During these 
 pauses, no new requests can come in.
 Around the time of pauses, we have log messages such as:
 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
 performance problem: getGroups(user=x) took 34507 milliseconds.
 The current theory is:
 1. Groups has a cache that is refreshed periodically. Each entry has a cache 
 expiry.
 2. When a cache entry expires, multiple threads can see this expiration and 
 then we have a thundering herd effect where all these threads hit the wire 
 and overwhelm our LDAP servers (we are using ShellBasedUnixGroupsMapping with 
 sssd, how this happens has yet to be established)
 3. group resolution queries begin to take longer, I've observed it taking 1.2 
 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
 `time groups myself`
 4. If there is mutual exclusion somewhere along this path, a 1 second pause 
 could lead to a 60 second pause as all the threads compete for the resource. 
 The exact cause hasn't been established
 Potential solutions include:
 1. Increasing group cache time, which will make the issue less frequent
 2. Rolling evictions of the cache so we prevent the large spike in LDAP 
 queries
 3. Gate the cache refresh so that only one thread is responsible for 
 refreshing the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11238) Group cache expiry causes namenode slowdown

2014-10-27 Thread Chris Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Li updated HADOOP-11238:
--
Description: 
Our namenode pauses for 12-60 seconds several times every hour or so. During 
these pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
performance problem: getGroups(user=x) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. 
2. When the cache is cleared, we have a thundering herd effect which overwhelms 
our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, how this 
happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 
seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
`time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause 
could lead to a 60 second pause as all the threads compete for the resource. 
The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries



  was:
Our namenode pauses for 12-60 seconds every hour or so. During these pauses, no 
new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
performance problem: getGroups(user=x) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. 
2. When the cache is cleared, we have a thundering herd effect which overwhelms 
our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, how this 
happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 
seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
`time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause 
could lead to a 60 second pause as all the threads compete for the resource. 
The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries




 Group cache expiry causes namenode slowdown
 ---

 Key: HADOOP-11238
 URL: https://issues.apache.org/jira/browse/HADOOP-11238
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.5.1
Reporter: Chris Li
Priority: Minor

 Our namenode pauses for 12-60 seconds several times every hour or so. During 
 these pauses, no new requests can come in.
 Around the time of pauses, we have log messages such as:
 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential 
 performance problem: getGroups(user=x) took 34507 milliseconds.
 The current theory is:
 1. Groups has a cache that is refreshed periodically. 
 2. When the cache is cleared, we have a thundering herd effect which 
 overwhelms our LDAP servers (we are using ShellBasedUnixGroupsMapping with 
 sssd, how this happens has yet to be established)
 3. group resolution queries begin to take longer, I've observed it taking 1.2 
 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell 
 `time groups myself`
 4. If there is mutual exclusion somewhere along this path, a 1 second pause 
 could lead to a 60 second pause as all the threads compete for the resource. 
 The exact cause hasn't been established
 Potential solutions include:
 1. Increasing group cache time, which will make the issue less frequent
 2. Rolling evictions of the cache so we prevent the large spike in LDAP 
 queries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)