Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-17 Thread Andrew E. Bruno
On Thu, Jul 16, 2015 at 10:32:59AM +0200, Ludwig Krispenz wrote:
 Thank you for the data, I think I understand now what is going on.
 
 In the error logs we see only message like (from my test env):
 
 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300
 (localhost:9759): replay_update: modifys operation (dn=dc=example,dc=com
 csn=55a82a2900010064) not sent - empty
 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300
 (localhost:9759): replay_update: Consumer successfully sent operation with
 csn 55a82a2900010064
 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300
 (localhost:9759): Skipping update operation with no message_id (uniqueid
 7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a2900010064):
 
 This happens if fractional replication is configured as IPA does and the
 changes affect only attributes which will NOT be replicated. So teh local
 RUV will be updated, but since no change is really sent, the consumer RUV is
 not updated and replciation will always set off from an very old starting
 csn. It is a rare scenario where a server receives only mods which are not
 replicated.
 
 I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225
 
 As  a workaround can you try to apply a mod on m14-26 which will not be
 stripped, either create a dummy user or add a description attribute to an
 existing object. Repliciation will once again iterate thru all the changes
 (which can take a while), but then should replay this latest change and
 define a new offset
 

Excellent. I can confirm your workaround fixed the issue. I updated a
users email address (on m14-26) and the load came down back to normal
within a few minutes. 

Thanks very much for all your help debugging this.

Best,

--Andrew



 Regards,
 Ludwig
 
 

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-16 Thread Ludwig Krispenz

Thank you for the data, I think I understand now what is going on.

In the error logs we see only message like (from my test env):

[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 
(localhost:9759): replay_update: modifys operation 
(dn=dc=example,dc=com csn=55a82a2900010064) not sent - empty
[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 
(localhost:9759): replay_update: Consumer successfully sent operation 
with csn 55a82a2900010064
[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 
(localhost:9759): Skipping update operation with no message_id (uniqueid 
7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a2900010064):


This happens if fractional replication is configured as IPA does and the 
changes affect only attributes which will NOT be replicated. So teh 
local RUV will be updated, but since no change is really sent, the 
consumer RUV is not updated and replciation will always set off from an 
very old starting csn. It is a rare scenario where a server receives 
only mods which are not replicated.


I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225

As  a workaround can you try to apply a mod on m14-26 which will not be 
stripped, either create a dummy user or add a description attribute to 
an existing object. Repliciation will once again iterate thru all the 
changes (which can take a while), but then should replay this latest 
change and define a new offset


Regards,
Ludwig


On 07/15/2015 07:05 PM, Andrew E. Bruno wrote:

On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote:

On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:

On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:

On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:

hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
Jul,7th - so it looks like it is iterating the changelog over and over
again.
Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
?

can you provide the result of the following search from
m14-24.ccr.buffalo.edu adn the server with the high cpu:

ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
objectclass=nsds5replica nsds50ruv

master is srv-m14-24.. here's the results of the ldapsearch:

[srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a55aed0014
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 5591a3d200070005
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

so this is really strange, the master m14-24 has the latest change from
replica 5(m14-26) as: 5591a3d200070005
which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
so no update from 14-24 since that did arrive, or could not update the ruv.
So m14-26 tries to replicate all the changes back from that time, but looks
like iit has no success.
is there anything in the logs of m14-24 ? can you see successful mods with
csn=xxx0005 ?

Here's what I could find from the logs on srv-m14-24:


[srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/*
access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT 
err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054

ok, so no update originating at replica 5 has been replicated (probably
since June,29) did you experience data inconsistency between the servers ?


And here's the last few lines the error log on srv-m14-24:

one set of messages refers to the o=ipaca backend and seem to be transient,
replication continues later.
the other set of msg No original tombstone .. is annoying (and it is fixed
in ticket https://fedorahosted.org/389/ticket/47912)

the next thing we can do to try to understand what is going on is to enable
replication logging on m14-26, it will then not only consume all cpu, but
write tons of messages to the error log.
But it can be turned on and off:

ldapmodify ...
dn: cn=config
replace: nsslapd-errorlog-level
nsslapd-errorlog-level: 8192

and let it run for a while, then set it back to: 0

I enabled replication logging and it's running now. I noticed the
default value for nsslapd-errorlog-level was set to 16384 (not 0).

OK to send you the logs off list? Looks like they contain quite a bit of
sensitive data.

Thanks again for all the help looking into this.

Best,

--Andrew




[12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2456070,cn=changelog!!
[12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-15 Thread Andrew E. Bruno
On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote:
 
 On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:
 On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:
 On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:
 On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
 hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, 
 and
 Jul,7th - so it looks like it is iterating the changelog over and over
 again.
 Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the 
 master
 ?
 
 can you provide the result of the following search from
 m14-24.ccr.buffalo.edu adn the server with the high cpu:
 
 ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
 objectclass=nsds5replica nsds50ruv
 master is srv-m14-24.. here's the results of the ldapsearch:
 
 [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W 
  -b cn=config objectclass=nsds5replica nsds50ruv
 
 # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
 tree,cn=config
 nsds50ruv: {replicageneration} 5527f7110004
 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
 5527f7710004 55a55aed0014
 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
 5537c7730005 5591a3d200070005
 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
 55943dda0006 5594537800020006
 so this is really strange, the master m14-24 has the latest change from
 replica 5(m14-26) as: 5591a3d200070005
 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
 so no update from 14-24 since that did arrive, or could not update the ruv.
 So m14-26 tries to replicate all the changes back from that time, but looks
 like iit has no success.
 is there anything in the logs of m14-24 ? can you see successful mods with
 csn=xxx0005 ?
 Here's what I could find from the logs on srv-m14-24:
 
 
 [srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/*
 access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 
 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054
 ok, so no update originating at replica 5 has been replicated (probably
 since June,29) did you experience data inconsistency between the servers ?
 
 
 And here's the last few lines the error log on srv-m14-24:
 one set of messages refers to the o=ipaca backend and seem to be transient,
 replication continues later.
 the other set of msg No original tombstone .. is annoying (and it is fixed
 in ticket https://fedorahosted.org/389/ticket/47912)
 
 the next thing we can do to try to understand what is going on is to enable
 replication logging on m14-26, it will then not only consume all cpu, but
 write tons of messages to the error log.
 But it can be turned on and off:
 
 ldapmodify ...
 dn: cn=config
 replace: nsslapd-errorlog-level
 nsslapd-errorlog-level: 8192
 
 and let it run for a while, then set it back to: 0

I enabled replication logging and it's running now. I noticed the
default value for nsslapd-errorlog-level was set to 16384 (not 0). 

OK to send you the logs off list? Looks like they contain quite a bit of
sensitive data.

Thanks again for all the help looking into this.

Best,

--Andrew



 
 
 [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
 original_tombstone for changenumber=2456070,cn=changelog!!
 [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
 original_tombstone for changenumber=2498441,cn=changelog!!
 [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations 
 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-15 Thread Andrew E. Bruno
On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:
 
 On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:
 On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
 hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
 Jul,7th - so it looks like it is iterating the changelog over and over
 again.
 Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
 ?
 
 can you provide the result of the following search from
 m14-24.ccr.buffalo.edu adn the server with the high cpu:
 
 ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
 objectclass=nsds5replica nsds50ruv
 
 master is srv-m14-24.. here's the results of the ldapsearch:
 
 [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  
 -b cn=config objectclass=nsds5replica nsds50ruv
 
 # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
 tree,cn=config
 nsds50ruv: {replicageneration} 5527f7110004
 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
 5527f7710004 55a55aed0014
 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
 5537c7730005 5591a3d200070005
 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
 55943dda0006 5594537800020006
 so this is really strange, the master m14-24 has the latest change from
 replica 5(m14-26) as: 5591a3d200070005
 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
 so no update from 14-24 since that did arrive, or could not update the ruv.
 So m14-26 tries to replicate all the changes back from that time, but looks
 like iit has no success.
 is there anything in the logs of m14-24 ? can you see successful mods with
 csn=xxx0005 ?

Here's what I could find from the logs on srv-m14-24:


[srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/*
access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT 
err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054


And here's the last few lines the error log on srv-m14-24:

[12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2456070,cn=changelog!!
[12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2498441,cn=changelog!!
[13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations 
error (1). Will retry later.
[13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations 
error (1). Will retry later.
[13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations 
error (1). Will retry later.
[13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations 
error (1). Will retry later.
[13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations 
error (1). Will retry later.
[13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a0060): Operations 
error (1). Will retry later.
[13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad60060): Operations 
error (1). Will retry later.
[13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f60060): Operations 
error (1). Will retry later.
[14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa0060): Operations 
error (1). Will retry later.
[14/Jul/2015:09:56:52 -0400] 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-15 Thread Ludwig Krispenz


On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:

hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
Jul,7th - so it looks like it is iterating the changelog over and over
again.
Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
?

can you provide the result of the following search from
m14-24.ccr.buffalo.edu adn the server with the high cpu:

ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
objectclass=nsds5replica nsds50ruv


master is srv-m14-24.. here's the results of the ldapsearch:

[srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a55aed0014
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 5591a3d200070005
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006
so this is really strange, the master m14-24 has the latest change from 
replica 5(m14-26) as: 5591a3d200070005

which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
so no update from 14-24 since that did arrive, or could not update the 
ruv. So m14-26 tries to replicate all the changes back from that time, 
but looks like iit has no success.
is there anything in the logs of m14-24 ? can you see successful mods 
with csn=xxx0005 ?


# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7540060 55a557f60060
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e0056 55943e6f00010056
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba005b 5582c7e40004005b


server with high cpu load is srv-m14-26. here's the results of the ldapsearch
from this server:

[srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 55a55b4700030005
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a53eba0004
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba005b 5582c7e40004005b
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7540060 55a557f60060
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e0056 55943e6f00010056


srv-m14-25-02 is our 3rd replicate which we recently added back in after it
failed (was added back in 7/1).

Let me know if you need anything else. Thanks for the help.

--Andrew


On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:

can you get a pstack of the slapd process along with a top -H to find th
ethread with high cpu usage

Attached is the full stacktrace of the running ns-slapd proccess. top -H
shows this thread (2879) with high cpu usage:

2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd

this thread is a replication thread sending updates, what is strange is that
the current csn_str is quite old (july, 7th), I can't tell which agreeement
this thread is handling, but looks like it is heavily reading the changeglog
and sending updates. anything changed recently in replication setup ?

Yes, we had one replica fail on (6/19) which we removed (not this one
showing high CPU load). Had to perform some manual cleanup of the ipa-ca
RUVs. Then we added the replica back in on 7/1. Since then, replication
appears to have been running normally between the 3 replicas. We've been
monitoring utilization since 7/1 and only recently seen this spike (past
24 hours or so).

is it still in this state ? or was it a spike.

Yes same state.


if it still is high cpu consuming, could you
- get a few 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-15 Thread Ludwig Krispenz


On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:

On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:

On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:

hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
Jul,7th - so it looks like it is iterating the changelog over and over
again.
Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
?

can you provide the result of the following search from
m14-24.ccr.buffalo.edu adn the server with the high cpu:

ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
objectclass=nsds5replica nsds50ruv

master is srv-m14-24.. here's the results of the ldapsearch:

[srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a55aed0014
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 5591a3d200070005
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

so this is really strange, the master m14-24 has the latest change from
replica 5(m14-26) as: 5591a3d200070005
which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
so no update from 14-24 since that did arrive, or could not update the ruv.
So m14-26 tries to replicate all the changes back from that time, but looks
like iit has no success.
is there anything in the logs of m14-24 ? can you see successful mods with
csn=xxx0005 ?

Here's what I could find from the logs on srv-m14-24:


[srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/*
access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT 
err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054
ok, so no update originating at replica 5 has been replicated (probably 
since June,29) did you experience data inconsistency between the servers ?



And here's the last few lines the error log on srv-m14-24:
one set of messages refers to the o=ipaca backend and seem to be 
transient, replication continues later.
the other set of msg No original tombstone .. is annoying (and it is 
fixed in ticket https://fedorahosted.org/389/ticket/47912)


the next thing we can do to try to understand what is going on is to 
enable replication logging on m14-26, it will then not only consume all 
cpu, but write tons of messages to the error log.

But it can be turned on and off:

ldapmodify ...
dn: cn=config
replace: nsslapd-errorlog-level
nsslapd-errorlog-level: 8192

and let it run for a while, then set it back to: 0



[12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2456070,cn=changelog!!
[12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2498441,cn=changelog!!
[13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations error (1). 
Will retry later.
[13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations error (1). 
Will retry later.
[13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations error (1). 
Will retry later.
[13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations error (1). 
Will retry later.
[13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations error (1). 
Will retry later.
[13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a0060): Operations error (1). 
Will retry later.
[13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-14 Thread Ludwig Krispenz


On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:

can you get a pstack of the slapd process along with a top -H to find th
ethread with high cpu usage

Attached is the full stacktrace of the running ns-slapd proccess. top -H
shows this thread (2879) with high cpu usage:

2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd

this thread is a replication thread sending updates, what is strange is that
the current csn_str is quite old (july, 7th), I can't tell which agreeement
this thread is handling, but looks like it is heavily reading the changeglog
and sending updates. anything changed recently in replication setup ?


Yes, we had one replica fail on (6/19) which we removed (not this one
showing high CPU load). Had to perform some manual cleanup of the ipa-ca
RUVs. Then we added the replica back in on 7/1. Since then, replication
appears to have been running normally between the 3 replicas. We've been
monitoring utilization since 7/1 and only recently seen this spike (past
24 hours or so).

is it still in this state ? or was it a spike.

if it still is high cpu consuming, could you
- get a few pstack like the one before with some time in between, I 
would like to see if it is progressing with the csns or looping on the 
same one
- check the consumer side. is there anything in the error log ? does the 
access log show replication activity from this server

- eventually enable replication logging: nsslapd-errorlog-level: 8192
  


On a side note, we get hit with this bug often:

https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html

(rouge sssd_be processing hammering a replica).

This causes high ns-slapd utilization on the replica and restarting sssd
on the client host immediately fixes the issue. However, in this
case, we're not seeing this behavior.










On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:

We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
389-ds 1.3.3.1-16.

Recently, the ns-slapd process on one of our replicas started showing higher
than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
constantly.

Seems very similar to this thread:
https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html

There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
sure if these are related):


[13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2277387,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:04:11:50 -0400] - Operation error fetching 
changenumber=2277387,cn=changelog (null), error -30993.
[13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2277387, dn = changenumber=2277387,cn=changelog: 
Operations error.
[13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
(01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
[13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2281464,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:07:41:49 -0400] - Operation error fetching 
changenumber=2281464,cn=changelog (null), error -30993.
[13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2281464, dn = changenumber=2281464,cn=changelog: 
Operations error.
[13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]


access logs seem to be showing normal activity. Here's the number of open
connections:

# ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
62

Note: the other two replicas have much higher open connections (250) and low
cpu load avgs.

Here's some output of logconv.pl from our most recent access log on the replica
with high cpu load:

Start of Logs:13/Jul/2015:04:49:18
End of Logs:  13/Jul/2015:10:06:11

Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds

Restarts: 0
Total Connections:2343
  - LDAP Connections:  2120
  - LDAPI Connections: 223
  - LDAPS Connections: 0
  - StartTLS Extended Ops: 45
  Secure Protocol Versions:
   - TLS1.2 128-bit AES - 45

Peak Concurrent Connections:  22
Total Operations: 111865
Total Results:111034
Overall Performance:  99.3%

Searches: 95585 (5.03/sec)  (301.64/min)
Modifications:3369  (0.18/sec)  (10.63/min)
Adds: 0 (0.00/sec)  (0.00/min)
Deletes:  0 (0.00/sec)  (0.00/min)
Mod RDNs: 0 (0.00/sec)  (0.00/min)
Compares: 0 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-14 Thread Ludwig Krispenz
hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, 
and Jul,7th - so it looks like it is iterating the changelog over and 
over again.
Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the 
master ?


can you provide the result of the following search from 
m14-24.ccr.buffalo.edu adn the server with the high cpu:


ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config 
objectclass=nsds5replica nsds50ruv


On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:

can you get a pstack of the slapd process along with a top -H to find th
ethread with high cpu usage

Attached is the full stacktrace of the running ns-slapd proccess. top -H
shows this thread (2879) with high cpu usage:

2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd

this thread is a replication thread sending updates, what is strange is that
the current csn_str is quite old (july, 7th), I can't tell which agreeement
this thread is handling, but looks like it is heavily reading the changeglog
and sending updates. anything changed recently in replication setup ?

Yes, we had one replica fail on (6/19) which we removed (not this one
showing high CPU load). Had to perform some manual cleanup of the ipa-ca
RUVs. Then we added the replica back in on 7/1. Since then, replication
appears to have been running normally between the 3 replicas. We've been
monitoring utilization since 7/1 and only recently seen this spike (past
24 hours or so).

is it still in this state ? or was it a spike.

Yes same state.


if it still is high cpu consuming, could you
- get a few pstack like the one before with some time in between, I would
like to see if it is progressing with the csns or looping on the same one

Attached are a few stacktraces. The thread pegging the cpu is:

PID  USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+  COMMAND
2879 dirsrv20   0 3819252 1.978g  11684 R  99.9  3.2  10148:26  ns-slapd


- check the consumer side. is there anything in the error log ? does the
access log show replication activity from this server


Here's some errors showing up on the first master server rep1 (rep2 is the
server with pegged cpu):

[13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-rep2-pki-tomcat (rep2:389): Consumer failed to 
replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad60060): 
Operations error (1). Will retry later.
[13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-rep2-pki-tomcat (rep2:389): Consumer failed to 
replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f60060): 
Operations error (1). Will retry later.
[14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-rep2-tomcat (rep2:389): Consumer failed to replay 
change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa0060): 
Operations error (1). Will retry later.


Here's some snips from the access log of the rep2:


[14/Jul/2015:08:22:31 -0400] conn=87 op=9794 EXT oid=2.16.840.1.113730.3.5.5 
name=Netscape Replication End Session
[14/Jul/2015:08:22:31 -0400] conn=87 op=9794 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:22:31 -0400] conn=87 op=9795 EXT oid=2.16.840.1.113730.3.5.12 
name=replication-multimaster-extop
[14/Jul/2015:08:22:31 -0400] conn=87 op=9795 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:22:33 -0400] conn=87 op=9796 EXT oid=2.16.840.1.113730.3.5.5 
name=Netscape Replication End Session
..
[14/Jul/2015:08:23:38 -0400] conn=782341 op=129 RESULT err=0 tag=103 nentries=0 
etime=0 csn=55a4ff6c0005
..
[14/Jul/2015:08:24:02 -0400] conn=781901 op=1745 RESULT err=0 tag=101 
nentries=1 etime=0
[14/Jul/2015:08:24:03 -0400] conn=87 op=9810 EXT oid=2.16.840.1.113730.3.5.5 
name=Netscape Replication End Session
[14/Jul/2015:08:24:03 -0400] conn=87 op=9810 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:03 -0400] conn=87 op=9811 EXT oid=2.16.840.1.113730.3.5.12 
name=replication-multimaster-extop
[14/Jul/2015:08:24:03 -0400] conn=87 op=9811 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:05 -0400] conn=87 op=9812 EXT oid=2.16.840.1.113730.3.5.5 
name=Netscape Replication End Session
[14/Jul/2015:08:24:05 -0400] conn=87 op=9812 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:08 -0400] conn=87 op=9813 EXT oid=2.16.840.1.113730.3.5.12 
name=replication-multimaster-extop
[14/Jul/2015:08:24:08 -0400] conn=87 op=9813 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:10 -0400] conn=87 op=9814 EXT oid=2.16.840.1.113730.3.5.5 
name=Netscape Replication End Session

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-14 Thread Andrew E. Bruno
On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
 hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
 Jul,7th - so it looks like it is iterating the changelog over and over
 again.
 Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
 ?
 
 can you provide the result of the following search from
 m14-24.ccr.buffalo.edu adn the server with the high cpu:
 
 ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
 objectclass=nsds5replica nsds50ruv


master is srv-m14-24.. here's the results of the ldapsearch:

[srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a55aed0014
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 5591a3d200070005
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7540060 55a557f60060
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e0056 55943e6f00010056
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba005b 5582c7e40004005b


server with high cpu load is srv-m14-26. here's the results of the ldapsearch
from this server:

[srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 55a55b4700030005
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a53eba0004
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba005b 5582c7e40004005b
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7540060 55a557f60060
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e0056 55943e6f00010056


srv-m14-25-02 is our 3rd replicate which we recently added back in after it
failed (was added back in 7/1).

Let me know if you need anything else. Thanks for the help.

--Andrew

 
 On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:
 On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:
 On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:
 On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
 On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
 On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
 can you get a pstack of the slapd process along with a top -H to find th
 ethread with high cpu usage
 Attached is the full stacktrace of the running ns-slapd proccess. top -H
 shows this thread (2879) with high cpu usage:
 
 2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 
 ns-slapd
 this thread is a replication thread sending updates, what is strange is 
 that
 the current csn_str is quite old (july, 7th), I can't tell which 
 agreeement
 this thread is handling, but looks like it is heavily reading the 
 changeglog
 and sending updates. anything changed recently in replication setup ?
 Yes, we had one replica fail on (6/19) which we removed (not this one
 showing high CPU load). Had to perform some manual cleanup of the ipa-ca
 RUVs. Then we added the replica back in on 7/1. Since then, replication
 appears to have been running normally between the 3 replicas. We've been
 monitoring utilization since 7/1 and only recently seen this spike (past
 24 hours or so).
 is it still in this state ? or was it a spike.
 Yes same state.
 
 if it still is high cpu consuming, could you
 - get a few pstack like the one before with some time in between, I would
 like to see if it is progressing with the csns or looping on the same one
 Attached are a few stacktraces. The thread pegging the cpu is:
 
 PID  USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+  COMMAND
 2879 dirsrv20   0 3819252 1.978g  11684 R  99.9  3.2  10148:26  ns-slapd
 
 - check the consumer side. is there anything in the error log ? does the
 access log show 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-13 Thread Ludwig Krispenz
can you get a pstack of the slapd process along with a top -H to find th 
ethread with high cpu usage


Ludwig

On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:

We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
389-ds 1.3.3.1-16.

Recently, the ns-slapd process on one of our replicas started showing higher
than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
constantly.

Seems very similar to this thread:
https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html

There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
sure if these are related):


[13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2277387,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:04:11:50 -0400] - Operation error fetching 
changenumber=2277387,cn=changelog (null), error -30993.
[13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2277387, dn = changenumber=2277387,cn=changelog: 
Operations error.
[13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
(01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
[13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2281464,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:07:41:49 -0400] - Operation error fetching 
changenumber=2281464,cn=changelog (null), error -30993.
[13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2281464, dn = changenumber=2281464,cn=changelog: 
Operations error.
[13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]


access logs seem to be showing normal activity. Here's the number of open
connections:

# ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
62

Note: the other two replicas have much higher open connections (250) and low
cpu load avgs.

Here's some output of logconv.pl from our most recent access log on the replica
with high cpu load:

Start of Logs:13/Jul/2015:04:49:18
End of Logs:  13/Jul/2015:10:06:11

Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds

Restarts: 0
Total Connections:2343
  - LDAP Connections:  2120
  - LDAPI Connections: 223
  - LDAPS Connections: 0
  - StartTLS Extended Ops: 45
  Secure Protocol Versions:
   - TLS1.2 128-bit AES - 45

Peak Concurrent Connections:  22
Total Operations: 111865
Total Results:111034
Overall Performance:  99.3%

Searches: 95585 (5.03/sec)  (301.64/min)
Modifications:3369  (0.18/sec)  (10.63/min)
Adds: 0 (0.00/sec)  (0.00/min)
Deletes:  0 (0.00/sec)  (0.00/min)
Mod RDNs: 0 (0.00/sec)  (0.00/min)
Compares: 0 (0.00/sec)  (0.00/min)
Binds:7082  (0.37/sec)  (22.35/min)

Proxied Auth Operations:  0
Persistent Searches:  0
Internal Operations:  0
Entry Operations: 0
Extended Operations:  5317
Abandoned Requests:   416
Smart Referrals Received: 0

VLV Operations:   96
VLV Unindexed Searches:   0
VLV Unindexed Components: 32
SORT Operations:  64

Entire Search Base Queries:   0
Paged Searches:   3882
Unindexed Searches:   0
Unindexed Components: 5

FDs Taken:2566
FDs Returned: 2643
Highest FD Taken: 249

Broken Pipes: 0
Connections Reset By Peer:0
Resource Unavailable: 0
Max BER Size Exceeded:0

Binds:7082
Unbinds:  2443
  - LDAP v2 Binds: 0
  - LDAP v3 Binds: 6859
  - AUTOBINDs: 223
  - SSL Client Binds:  0
  - Failed SSL Client Binds:   0
  - SASL Binds:6814
 GSSAPI - 6591
 EXTERNAL - 223
  - Directory Manager Binds:   0
  - Anonymous Binds:   6591
  - Other Binds:   491




strace timing on the ns-slapd process:


% time seconds  usecs/call callserrors syscall
-- --- --- - - 
  94.400.346659597758   poll
   4.100.015057   15057 1   restart_syscall
   0.910.003353  575959 getpeername
   0.490.001796 15012   futex
   0.100.000364  73 5   read
-- --- --- - - 
100.000.367229   13559 total


top output (with threads 'H'):

   PID USER  

[Freeipa-users] ns-slapd high cpu usage

2015-07-13 Thread Andrew E. Bruno
We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
389-ds 1.3.3.1-16. 

Recently, the ns-slapd process on one of our replicas started showing higher
than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
constantly.

Seems very similar to this thread:
https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html

There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
sure if these are related):


[13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2277387,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:04:11:50 -0400] - Operation error fetching 
changenumber=2277387,cn=changelog (null), error -30993.
[13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2277387, dn = changenumber=2277387,cn=changelog: 
Operations error. 
[13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
(01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
[13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2281464,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:07:41:49 -0400] - Operation error fetching 
changenumber=2281464,cn=changelog (null), error -30993.
[13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2281464, dn = changenumber=2281464,cn=changelog: 
Operations error. 
[13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]


access logs seem to be showing normal activity. Here's the number of open
connections:

# ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
62

Note: the other two replicas have much higher open connections (250) and low
cpu load avgs.

Here's some output of logconv.pl from our most recent access log on the replica
with high cpu load:

Start of Logs:13/Jul/2015:04:49:18
End of Logs:  13/Jul/2015:10:06:11

Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds

Restarts: 0
Total Connections:2343
 - LDAP Connections:  2120
 - LDAPI Connections: 223
 - LDAPS Connections: 0
 - StartTLS Extended Ops: 45
 Secure Protocol Versions:
  - TLS1.2 128-bit AES - 45

Peak Concurrent Connections:  22
Total Operations: 111865
Total Results:111034
Overall Performance:  99.3%

Searches: 95585 (5.03/sec)  (301.64/min)
Modifications:3369  (0.18/sec)  (10.63/min)
Adds: 0 (0.00/sec)  (0.00/min)
Deletes:  0 (0.00/sec)  (0.00/min)
Mod RDNs: 0 (0.00/sec)  (0.00/min)
Compares: 0 (0.00/sec)  (0.00/min)
Binds:7082  (0.37/sec)  (22.35/min)

Proxied Auth Operations:  0
Persistent Searches:  0
Internal Operations:  0
Entry Operations: 0
Extended Operations:  5317
Abandoned Requests:   416
Smart Referrals Received: 0

VLV Operations:   96
VLV Unindexed Searches:   0
VLV Unindexed Components: 32
SORT Operations:  64

Entire Search Base Queries:   0
Paged Searches:   3882
Unindexed Searches:   0
Unindexed Components: 5

FDs Taken:2566
FDs Returned: 2643
Highest FD Taken: 249

Broken Pipes: 0
Connections Reset By Peer:0
Resource Unavailable: 0
Max BER Size Exceeded:0

Binds:7082
Unbinds:  2443
 - LDAP v2 Binds: 0
 - LDAP v3 Binds: 6859
 - AUTOBINDs: 223
 - SSL Client Binds:  0
 - Failed SSL Client Binds:   0
 - SASL Binds:6814
GSSAPI - 6591
EXTERNAL - 223
 - Directory Manager Binds:   0
 - Anonymous Binds:   6591
 - Other Binds:   491




strace timing on the ns-slapd process:


% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 94.400.346659597758   poll
  4.100.015057   15057 1   restart_syscall
  0.910.003353  575959 getpeername
  0.490.001796 15012   futex
  0.100.000364  73 5   read
-- --- --- - - 
100.000.367229   13559 total


top output (with threads 'H'):

  PID USER  PR  NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND  


 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-13 Thread Ludwig Krispenz


On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:

can you get a pstack of the slapd process along with a top -H to find th
ethread with high cpu usage

Attached is the full stacktrace of the running ns-slapd proccess. top -H
shows this thread (2879) with high cpu usage:

2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
this thread is a replication thread sending updates, what is strange is 
that the current csn_str is quite old (july, 7th), I can't tell which 
agreeement this thread is handling, but looks like it is heavily reading 
the changeglog and sending updates. anything changed recently in 
replication setup ?







On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:

We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
389-ds 1.3.3.1-16.

Recently, the ns-slapd process on one of our replicas started showing higher
than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
constantly.

Seems very similar to this thread:
https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html

There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
sure if these are related):


[13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2277387,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:04:11:50 -0400] - Operation error fetching 
changenumber=2277387,cn=changelog (null), error -30993.
[13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2277387, dn = changenumber=2277387,cn=changelog: 
Operations error.
[13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
(01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
[13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2281464,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:07:41:49 -0400] - Operation error fetching 
changenumber=2281464,cn=changelog (null), error -30993.
[13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2281464, dn = changenumber=2281464,cn=changelog: 
Operations error.
[13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]


access logs seem to be showing normal activity. Here's the number of open
connections:

# ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
62

Note: the other two replicas have much higher open connections (250) and low
cpu load avgs.

Here's some output of logconv.pl from our most recent access log on the replica
with high cpu load:

Start of Logs:13/Jul/2015:04:49:18
End of Logs:  13/Jul/2015:10:06:11

Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds

Restarts: 0
Total Connections:2343
  - LDAP Connections:  2120
  - LDAPI Connections: 223
  - LDAPS Connections: 0
  - StartTLS Extended Ops: 45
  Secure Protocol Versions:
   - TLS1.2 128-bit AES - 45

Peak Concurrent Connections:  22
Total Operations: 111865
Total Results:111034
Overall Performance:  99.3%

Searches: 95585 (5.03/sec)  (301.64/min)
Modifications:3369  (0.18/sec)  (10.63/min)
Adds: 0 (0.00/sec)  (0.00/min)
Deletes:  0 (0.00/sec)  (0.00/min)
Mod RDNs: 0 (0.00/sec)  (0.00/min)
Compares: 0 (0.00/sec)  (0.00/min)
Binds:7082  (0.37/sec)  (22.35/min)

Proxied Auth Operations:  0
Persistent Searches:  0
Internal Operations:  0
Entry Operations: 0
Extended Operations:  5317
Abandoned Requests:   416
Smart Referrals Received: 0

VLV Operations:   96
VLV Unindexed Searches:   0
VLV Unindexed Components: 32
SORT Operations:  64

Entire Search Base Queries:   0
Paged Searches:   3882
Unindexed Searches:   0
Unindexed Components: 5

FDs Taken:2566
FDs Returned: 2643
Highest FD Taken: 249

Broken Pipes: 0
Connections Reset By Peer:0
Resource Unavailable: 0
Max BER Size Exceeded:0

Binds:7082
Unbinds:  2443
  - LDAP v2 Binds: 0
  - LDAP v3 Binds: 6859
  - AUTOBINDs: 223
  - SSL Client Binds:  0
  - Failed SSL Client Binds:   0
  - SASL Binds:6814
 GSSAPI - 6591
 EXTERNAL - 223
  - Directory Manager Binds:   0
  - Anonymous Binds:   6591
  - Other Binds:   491




strace timing on the ns-slapd 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-13 Thread Andrew E. Bruno
On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
 
 On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
 On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
 can you get a pstack of the slapd process along with a top -H to find th
 ethread with high cpu usage
 Attached is the full stacktrace of the running ns-slapd proccess. top -H
 shows this thread (2879) with high cpu usage:
 
 2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
 this thread is a replication thread sending updates, what is strange is that
 the current csn_str is quite old (july, 7th), I can't tell which agreeement
 this thread is handling, but looks like it is heavily reading the changeglog
 and sending updates. anything changed recently in replication setup ?


Yes, we had one replica fail on (6/19) which we removed (not this one
showing high CPU load). Had to perform some manual cleanup of the ipa-ca
RUVs. Then we added the replica back in on 7/1. Since then, replication
appears to have been running normally between the 3 replicas. We've been
monitoring utilization since 7/1 and only recently seen this spike (past
24 hours or so). 

On a side note, we get hit with this bug often:

https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html

(rouge sssd_be processing hammering a replica). 

This causes high ns-slapd utilization on the replica and restarting sssd
on the client host immediately fixes the issue. However, in this
case, we're not seeing this behavior.




 
 
 
 
 
 On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:
 We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
 389-ds 1.3.3.1-16.
 
 Recently, the ns-slapd process on one of our replicas started showing 
 higher
 than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
 constantly.
 
 Seems very similar to this thread:
 https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html
 
 There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors 
 (not
 sure if these are related):
 
 
 [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation 
 failure [1]
 [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
 changenumber=2277387,cn=changelog from entryrdn index (-30993)
 [13/Jul/2015:04:11:50 -0400] - Operation error fetching 
 changenumber=2277387,cn=changelog (null), error -30993.
 [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured 
 while adding change number 2277387, dn = 
 changenumber=2277387,cn=changelog: Operations error.
 [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation 
 failure [1]
 [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
 (01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
 [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
 changenumber=2281464,cn=changelog from entryrdn index (-30993)
 [13/Jul/2015:07:41:49 -0400] - Operation error fetching 
 changenumber=2281464,cn=changelog (null), error -30993.
 [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured 
 while adding change number 2281464, dn = 
 changenumber=2281464,cn=changelog: Operations error.
 [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation 
 failure [1]
 
 
 access logs seem to be showing normal activity. Here's the number of open
 connections:
 
 # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc 
 -l
 62
 
 Note: the other two replicas have much higher open connections (250) and 
 low
 cpu load avgs.
 
 Here's some output of logconv.pl from our most recent access log on the 
 replica
 with high cpu load:
 
 Start of Logs:13/Jul/2015:04:49:18
 End of Logs:  13/Jul/2015:10:06:11
 
 Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds
 
 Restarts: 0
 Total Connections:2343
   - LDAP Connections:  2120
   - LDAPI Connections: 223
   - LDAPS Connections: 0
   - StartTLS Extended Ops: 45
   Secure Protocol Versions:
- TLS1.2 128-bit AES - 45
 
 Peak Concurrent Connections:  22
 Total Operations: 111865
 Total Results:111034
 Overall Performance:  99.3%
 
 Searches: 95585 (5.03/sec)  (301.64/min)
 Modifications:3369  (0.18/sec)  (10.63/min)
 Adds: 0 (0.00/sec)  (0.00/min)
 Deletes:  0 (0.00/sec)  (0.00/min)
 Mod RDNs: 0 (0.00/sec)  (0.00/min)
 Compares: 0 (0.00/sec)  (0.00/min)
 Binds:7082  (0.37/sec)  (22.35/min)
 
 Proxied Auth Operations:  0
 Persistent Searches:  0
 Internal Operations:  0
 Entry Operations: 0
 Extended Operations:  5317
 Abandoned Requests:   416
 Smart Referrals Received: 0
 
 VLV Operations:   96
 VLV Unindexed