Re: [Freeipa-users] ns-slapd high cpu usage
On Thu, Jul 16, 2015 at 10:32:59AM +0200, Ludwig Krispenz wrote: Thank you for the data, I think I understand now what is going on. In the error logs we see only message like (from my test env): [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 (localhost:9759): replay_update: modifys operation (dn=dc=example,dc=com csn=55a82a2900010064) not sent - empty [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 (localhost:9759): replay_update: Consumer successfully sent operation with csn 55a82a2900010064 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 (localhost:9759): Skipping update operation with no message_id (uniqueid 7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a2900010064): This happens if fractional replication is configured as IPA does and the changes affect only attributes which will NOT be replicated. So teh local RUV will be updated, but since no change is really sent, the consumer RUV is not updated and replciation will always set off from an very old starting csn. It is a rare scenario where a server receives only mods which are not replicated. I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225 As a workaround can you try to apply a mod on m14-26 which will not be stripped, either create a dummy user or add a description attribute to an existing object. Repliciation will once again iterate thru all the changes (which can take a while), but then should replay this latest change and define a new offset Excellent. I can confirm your workaround fixed the issue. I updated a users email address (on m14-26) and the load came down back to normal within a few minutes. Thanks very much for all your help debugging this. Best, --Andrew Regards, Ludwig -- Manage your subscription for the Freeipa-users mailing list: https://www.redhat.com/mailman/listinfo/freeipa-users Go to http://freeipa.org for more info on the project
Re: [Freeipa-users] ns-slapd high cpu usage
Thank you for the data, I think I understand now what is going on. In the error logs we see only message like (from my test env): [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 (localhost:9759): replay_update: modifys operation (dn=dc=example,dc=com csn=55a82a2900010064) not sent - empty [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 (localhost:9759): replay_update: Consumer successfully sent operation with csn 55a82a2900010064 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300 (localhost:9759): Skipping update operation with no message_id (uniqueid 7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a2900010064): This happens if fractional replication is configured as IPA does and the changes affect only attributes which will NOT be replicated. So teh local RUV will be updated, but since no change is really sent, the consumer RUV is not updated and replciation will always set off from an very old starting csn. It is a rare scenario where a server receives only mods which are not replicated. I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225 As a workaround can you try to apply a mod on m14-26 which will not be stripped, either create a dummy user or add a description attribute to an existing object. Repliciation will once again iterate thru all the changes (which can take a while), but then should replay this latest change and define a new offset Regards, Ludwig On 07/15/2015 07:05 PM, Andrew E. Bruno wrote: On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote: On 07/15/2015 04:10 PM, Andrew E. Bruno wrote: On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote: On 07/14/2015 08:59 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote: hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv master is srv-m14-24.. here's the results of the ldapsearch: [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a55aed0014 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 5591a3d200070005 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 so this is really strange, the master m14-24 has the latest change from replica 5(m14-26) as: 5591a3d200070005 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT so no update from 14-24 since that did arrive, or could not update the ruv. So m14-26 tries to replicate all the changes back from that time, but looks like iit has no success. is there anything in the logs of m14-24 ? can you see successful mods with csn=xxx0005 ? Here's what I could find from the logs on srv-m14-24: [srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/* access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054 ok, so no update originating at replica 5 has been replicated (probably since June,29) did you experience data inconsistency between the servers ? And here's the last few lines the error log on srv-m14-24: one set of messages refers to the o=ipaca backend and seem to be transient, replication continues later. the other set of msg No original tombstone .. is annoying (and it is fixed in ticket https://fedorahosted.org/389/ticket/47912) the next thing we can do to try to understand what is going on is to enable replication logging on m14-26, it will then not only consume all cpu, but write tons of messages to the error log. But it can be turned on and off: ldapmodify ... dn: cn=config replace: nsslapd-errorlog-level nsslapd-errorlog-level: 8192 and let it run for a while, then set it back to: 0 I enabled replication logging and it's running now. I noticed the default value for nsslapd-errorlog-level was set to 16384 (not 0). OK to send you the logs off list? Looks like they contain quite a bit of sensitive data. Thanks again for all the help looking into this. Best, --Andrew [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2456070,cn=changelog!! [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1]
Re: [Freeipa-users] ns-slapd high cpu usage
On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote: On 07/15/2015 04:10 PM, Andrew E. Bruno wrote: On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote: On 07/14/2015 08:59 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote: hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv master is srv-m14-24.. here's the results of the ldapsearch: [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a55aed0014 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 5591a3d200070005 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 so this is really strange, the master m14-24 has the latest change from replica 5(m14-26) as: 5591a3d200070005 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT so no update from 14-24 since that did arrive, or could not update the ruv. So m14-26 tries to replicate all the changes back from that time, but looks like iit has no success. is there anything in the logs of m14-24 ? can you see successful mods with csn=xxx0005 ? Here's what I could find from the logs on srv-m14-24: [srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/* access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054 ok, so no update originating at replica 5 has been replicated (probably since June,29) did you experience data inconsistency between the servers ? And here's the last few lines the error log on srv-m14-24: one set of messages refers to the o=ipaca backend and seem to be transient, replication continues later. the other set of msg No original tombstone .. is annoying (and it is fixed in ticket https://fedorahosted.org/389/ticket/47912) the next thing we can do to try to understand what is going on is to enable replication logging on m14-26, it will then not only consume all cpu, but write tons of messages to the error log. But it can be turned on and off: ldapmodify ... dn: cn=config replace: nsslapd-errorlog-level nsslapd-errorlog-level: 8192 and let it run for a while, then set it back to: 0 I enabled replication logging and it's running now. I noticed the default value for nsslapd-errorlog-level was set to 16384 (not 0). OK to send you the logs off list? Looks like they contain quite a bit of sensitive data. Thanks again for all the help looking into this. Best, --Andrew [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2456070,cn=changelog!! [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2498441,cn=changelog!! [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations error (1). Will retry later. [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations error (1). Will retry later. [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations error (1). Will retry later. [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations error (1). Will retry later. [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations
Re: [Freeipa-users] ns-slapd high cpu usage
On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote: On 07/14/2015 08:59 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote: hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv master is srv-m14-24.. here's the results of the ldapsearch: [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a55aed0014 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 5591a3d200070005 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 so this is really strange, the master m14-24 has the latest change from replica 5(m14-26) as: 5591a3d200070005 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT so no update from 14-24 since that did arrive, or could not update the ruv. So m14-26 tries to replicate all the changes back from that time, but looks like iit has no success. is there anything in the logs of m14-24 ? can you see successful mods with csn=xxx0005 ? Here's what I could find from the logs on srv-m14-24: [srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/* access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054 And here's the last few lines the error log on srv-m14-24: [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2456070,cn=changelog!! [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2498441,cn=changelog!! [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations error (1). Will retry later. [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations error (1). Will retry later. [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations error (1). Will retry later. [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations error (1). Will retry later. [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations error (1). Will retry later. [13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a0060): Operations error (1). Will retry later. [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad60060): Operations error (1). Will retry later. [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f60060): Operations error (1). Will retry later. [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa0060): Operations error (1). Will retry later. [14/Jul/2015:09:56:52 -0400]
Re: [Freeipa-users] ns-slapd high cpu usage
On 07/14/2015 08:59 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote: hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv master is srv-m14-24.. here's the results of the ldapsearch: [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a55aed0014 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 5591a3d200070005 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 so this is really strange, the master m14-24 has the latest change from replica 5(m14-26) as: 5591a3d200070005 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT so no update from 14-24 since that did arrive, or could not update the ruv. So m14-26 tries to replicate all the changes back from that time, but looks like iit has no success. is there anything in the logs of m14-24 ? can you see successful mods with csn=xxx0005 ? # replica, o\3Dipaca, mapping tree, config dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f74b0060 nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7540060 55a557f60060 nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e0056 55943e6f00010056 nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba005b 5582c7e40004005b server with high cpu load is srv-m14-26. here's the results of the ldapsearch from this server: [srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 55a55b4700030005 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a53eba0004 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 # replica, o\3Dipaca, mapping tree, config dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f74b0060 nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba005b 5582c7e40004005b nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7540060 55a557f60060 nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e0056 55943e6f00010056 srv-m14-25-02 is our 3rd replicate which we recently added back in after it failed (was added back in 7/1). Let me know if you need anything else. Thanks for the help. --Andrew On 07/14/2015 02:35 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote: On 07/13/2015 06:36 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote: On 07/13/2015 05:05 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote: can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Attached is the full stacktrace of the running ns-slapd proccess. top -H shows this thread (2879) with high cpu usage: 2879 dirsrv20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd this thread is a replication thread sending updates, what is strange is that the current csn_str is quite old (july, 7th), I can't tell which agreeement this thread is handling, but looks like it is heavily reading the changeglog and sending updates. anything changed recently in replication setup ? Yes, we had one replica fail on (6/19) which we removed (not this one showing high CPU load). Had to perform some manual cleanup of the ipa-ca RUVs. Then we added the replica back in on 7/1. Since then, replication appears to have been running normally between the 3 replicas. We've been monitoring utilization since 7/1 and only recently seen this spike (past 24 hours or so). is it still in this state ? or was it a spike. Yes same state. if it still is high cpu consuming, could you - get a few
Re: [Freeipa-users] ns-slapd high cpu usage
On 07/15/2015 04:10 PM, Andrew E. Bruno wrote: On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote: On 07/14/2015 08:59 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote: hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv master is srv-m14-24.. here's the results of the ldapsearch: [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a55aed0014 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 5591a3d200070005 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 so this is really strange, the master m14-24 has the latest change from replica 5(m14-26) as: 5591a3d200070005 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT so no update from 14-24 since that did arrive, or could not update the ruv. So m14-26 tries to replicate all the changes back from that time, but looks like iit has no success. is there anything in the logs of m14-24 ? can you see successful mods with csn=xxx0005 ? Here's what I could find from the logs on srv-m14-24: [srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/* access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054 ok, so no update originating at replica 5 has been replicated (probably since June,29) did you experience data inconsistency between the servers ? And here's the last few lines the error log on srv-m14-24: one set of messages refers to the o=ipaca backend and seem to be transient, replication continues later. the other set of msg No original tombstone .. is annoying (and it is fixed in ticket https://fedorahosted.org/389/ticket/47912) the next thing we can do to try to understand what is going on is to enable replication logging on m14-26, it will then not only consume all cpu, but write tons of messages to the error log. But it can be turned on and off: ldapmodify ... dn: cn=config replace: nsslapd-errorlog-level nsslapd-errorlog-level: 8192 and let it run for a while, then set it back to: 0 [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2456070,cn=changelog!! [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2498441,cn=changelog!! [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations error (1). Will retry later. [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations error (1). Will retry later. [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations error (1). Will retry later. [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations error (1). Will retry later. [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations error (1). Will retry later. [13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a0060): Operations error (1). Will retry later. [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat
Re: [Freeipa-users] ns-slapd high cpu usage
On 07/13/2015 06:36 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote: On 07/13/2015 05:05 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote: can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Attached is the full stacktrace of the running ns-slapd proccess. top -H shows this thread (2879) with high cpu usage: 2879 dirsrv20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd this thread is a replication thread sending updates, what is strange is that the current csn_str is quite old (july, 7th), I can't tell which agreeement this thread is handling, but looks like it is heavily reading the changeglog and sending updates. anything changed recently in replication setup ? Yes, we had one replica fail on (6/19) which we removed (not this one showing high CPU load). Had to perform some manual cleanup of the ipa-ca RUVs. Then we added the replica back in on 7/1. Since then, replication appears to have been running normally between the 3 replicas. We've been monitoring utilization since 7/1 and only recently seen this spike (past 24 hours or so). is it still in this state ? or was it a spike. if it still is high cpu consuming, could you - get a few pstack like the one before with some time in between, I would like to see if it is progressing with the csns or looping on the same one - check the consumer side. is there anything in the error log ? does the access log show replication activity from this server - eventually enable replication logging: nsslapd-errorlog-level: 8192 On a side note, we get hit with this bug often: https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html (rouge sssd_be processing hammering a replica). This causes high ns-slapd utilization on the replica and restarting sssd on the client host immediately fixes the issue. However, in this case, we're not seeing this behavior. On 07/13/2015 04:46 PM, Andrew E. Bruno wrote: We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and 389-ds 1.3.3.1-16. Recently, the ns-slapd process on one of our replicas started showing higher than normal CPU usage. ns-slapd is pegged at high CPU usage more or less constantly. Seems very similar to this thread: https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not sure if these are related): [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993) [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993. [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error. [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993. [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993) [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993. [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error. [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] access logs seem to be showing normal activity. Here's the number of open connections: # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l 62 Note: the other two replicas have much higher open connections (250) and low cpu load avgs. Here's some output of logconv.pl from our most recent access log on the replica with high cpu load: Start of Logs:13/Jul/2015:04:49:18 End of Logs: 13/Jul/2015:10:06:11 Processed Log Time: 5 Hours, 16 Minutes, 53 Seconds Restarts: 0 Total Connections:2343 - LDAP Connections: 2120 - LDAPI Connections: 223 - LDAPS Connections: 0 - StartTLS Extended Ops: 45 Secure Protocol Versions: - TLS1.2 128-bit AES - 45 Peak Concurrent Connections: 22 Total Operations: 111865 Total Results:111034 Overall Performance: 99.3% Searches: 95585 (5.03/sec) (301.64/min) Modifications:3369 (0.18/sec) (10.63/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0
Re: [Freeipa-users] ns-slapd high cpu usage
hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv On 07/14/2015 02:35 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote: On 07/13/2015 06:36 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote: On 07/13/2015 05:05 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote: can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Attached is the full stacktrace of the running ns-slapd proccess. top -H shows this thread (2879) with high cpu usage: 2879 dirsrv20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd this thread is a replication thread sending updates, what is strange is that the current csn_str is quite old (july, 7th), I can't tell which agreeement this thread is handling, but looks like it is heavily reading the changeglog and sending updates. anything changed recently in replication setup ? Yes, we had one replica fail on (6/19) which we removed (not this one showing high CPU load). Had to perform some manual cleanup of the ipa-ca RUVs. Then we added the replica back in on 7/1. Since then, replication appears to have been running normally between the 3 replicas. We've been monitoring utilization since 7/1 and only recently seen this spike (past 24 hours or so). is it still in this state ? or was it a spike. Yes same state. if it still is high cpu consuming, could you - get a few pstack like the one before with some time in between, I would like to see if it is progressing with the csns or looping on the same one Attached are a few stacktraces. The thread pegging the cpu is: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 2879 dirsrv20 0 3819252 1.978g 11684 R 99.9 3.2 10148:26 ns-slapd - check the consumer side. is there anything in the error log ? does the access log show replication activity from this server Here's some errors showing up on the first master server rep1 (rep2 is the server with pegged cpu): [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-rep2-pki-tomcat (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad60060): Operations error (1). Will retry later. [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-rep2-pki-tomcat (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f60060): Operations error (1). Will retry later. [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt=cn=masterAgreement1-rep2-tomcat (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa0060): Operations error (1). Will retry later. Here's some snips from the access log of the rep2: [14/Jul/2015:08:22:31 -0400] conn=87 op=9794 EXT oid=2.16.840.1.113730.3.5.5 name=Netscape Replication End Session [14/Jul/2015:08:22:31 -0400] conn=87 op=9794 RESULT err=0 tag=120 nentries=0 etime=0 [14/Jul/2015:08:22:31 -0400] conn=87 op=9795 EXT oid=2.16.840.1.113730.3.5.12 name=replication-multimaster-extop [14/Jul/2015:08:22:31 -0400] conn=87 op=9795 RESULT err=0 tag=120 nentries=0 etime=0 [14/Jul/2015:08:22:33 -0400] conn=87 op=9796 EXT oid=2.16.840.1.113730.3.5.5 name=Netscape Replication End Session .. [14/Jul/2015:08:23:38 -0400] conn=782341 op=129 RESULT err=0 tag=103 nentries=0 etime=0 csn=55a4ff6c0005 .. [14/Jul/2015:08:24:02 -0400] conn=781901 op=1745 RESULT err=0 tag=101 nentries=1 etime=0 [14/Jul/2015:08:24:03 -0400] conn=87 op=9810 EXT oid=2.16.840.1.113730.3.5.5 name=Netscape Replication End Session [14/Jul/2015:08:24:03 -0400] conn=87 op=9810 RESULT err=0 tag=120 nentries=0 etime=0 [14/Jul/2015:08:24:03 -0400] conn=87 op=9811 EXT oid=2.16.840.1.113730.3.5.12 name=replication-multimaster-extop [14/Jul/2015:08:24:03 -0400] conn=87 op=9811 RESULT err=0 tag=120 nentries=0 etime=0 [14/Jul/2015:08:24:05 -0400] conn=87 op=9812 EXT oid=2.16.840.1.113730.3.5.5 name=Netscape Replication End Session [14/Jul/2015:08:24:05 -0400] conn=87 op=9812 RESULT err=0 tag=120 nentries=0 etime=0 [14/Jul/2015:08:24:08 -0400] conn=87 op=9813 EXT oid=2.16.840.1.113730.3.5.12 name=replication-multimaster-extop [14/Jul/2015:08:24:08 -0400] conn=87 op=9813 RESULT err=0 tag=120 nentries=0 etime=0 [14/Jul/2015:08:24:10 -0400] conn=87 op=9814 EXT oid=2.16.840.1.113730.3.5.5 name=Netscape Replication End Session
Re: [Freeipa-users] ns-slapd high cpu usage
On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote: hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and Jul,7th - so it looks like it is iterating the changelog over and over again. Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master ? can you provide the result of the following search from m14-24.ccr.buffalo.edu adn the server with the high cpu: ldapsearch -o ldif-wrap=no -x -D ... -w -b cn=config objectclass=nsds5replica nsds50ruv master is srv-m14-24.. here's the results of the ldapsearch: [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a55aed0014 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 5591a3d200070005 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 # replica, o\3Dipaca, mapping tree, config dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f74b0060 nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7540060 55a557f60060 nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e0056 55943e6f00010056 nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba005b 5582c7e40004005b server with high cpu load is srv-m14-26. here's the results of the ldapsearch from this server: [srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W -b cn=config objectclass=nsds5replica nsds50ruv # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f7110004 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7730005 55a55b4700030005 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7710004 55a53eba0004 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda0006 5594537800020006 # replica, o\3Dipaca, mapping tree, config dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config nsds50ruv: {replicageneration} 5527f74b0060 nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba005b 5582c7e40004005b nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f7540060 55a557f60060 nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e0056 55943e6f00010056 srv-m14-25-02 is our 3rd replicate which we recently added back in after it failed (was added back in 7/1). Let me know if you need anything else. Thanks for the help. --Andrew On 07/14/2015 02:35 PM, Andrew E. Bruno wrote: On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote: On 07/13/2015 06:36 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote: On 07/13/2015 05:05 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote: can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Attached is the full stacktrace of the running ns-slapd proccess. top -H shows this thread (2879) with high cpu usage: 2879 dirsrv20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd this thread is a replication thread sending updates, what is strange is that the current csn_str is quite old (july, 7th), I can't tell which agreeement this thread is handling, but looks like it is heavily reading the changeglog and sending updates. anything changed recently in replication setup ? Yes, we had one replica fail on (6/19) which we removed (not this one showing high CPU load). Had to perform some manual cleanup of the ipa-ca RUVs. Then we added the replica back in on 7/1. Since then, replication appears to have been running normally between the 3 replicas. We've been monitoring utilization since 7/1 and only recently seen this spike (past 24 hours or so). is it still in this state ? or was it a spike. Yes same state. if it still is high cpu consuming, could you - get a few pstack like the one before with some time in between, I would like to see if it is progressing with the csns or looping on the same one Attached are a few stacktraces. The thread pegging the cpu is: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 2879 dirsrv20 0 3819252 1.978g 11684 R 99.9 3.2 10148:26 ns-slapd - check the consumer side. is there anything in the error log ? does the access log show
Re: [Freeipa-users] ns-slapd high cpu usage
can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Ludwig On 07/13/2015 04:46 PM, Andrew E. Bruno wrote: We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and 389-ds 1.3.3.1-16. Recently, the ns-slapd process on one of our replicas started showing higher than normal CPU usage. ns-slapd is pegged at high CPU usage more or less constantly. Seems very similar to this thread: https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not sure if these are related): [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993) [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993. [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error. [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993. [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993) [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993. [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error. [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] access logs seem to be showing normal activity. Here's the number of open connections: # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l 62 Note: the other two replicas have much higher open connections (250) and low cpu load avgs. Here's some output of logconv.pl from our most recent access log on the replica with high cpu load: Start of Logs:13/Jul/2015:04:49:18 End of Logs: 13/Jul/2015:10:06:11 Processed Log Time: 5 Hours, 16 Minutes, 53 Seconds Restarts: 0 Total Connections:2343 - LDAP Connections: 2120 - LDAPI Connections: 223 - LDAPS Connections: 0 - StartTLS Extended Ops: 45 Secure Protocol Versions: - TLS1.2 128-bit AES - 45 Peak Concurrent Connections: 22 Total Operations: 111865 Total Results:111034 Overall Performance: 99.3% Searches: 95585 (5.03/sec) (301.64/min) Modifications:3369 (0.18/sec) (10.63/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds:7082 (0.37/sec) (22.35/min) Proxied Auth Operations: 0 Persistent Searches: 0 Internal Operations: 0 Entry Operations: 0 Extended Operations: 5317 Abandoned Requests: 416 Smart Referrals Received: 0 VLV Operations: 96 VLV Unindexed Searches: 0 VLV Unindexed Components: 32 SORT Operations: 64 Entire Search Base Queries: 0 Paged Searches: 3882 Unindexed Searches: 0 Unindexed Components: 5 FDs Taken:2566 FDs Returned: 2643 Highest FD Taken: 249 Broken Pipes: 0 Connections Reset By Peer:0 Resource Unavailable: 0 Max BER Size Exceeded:0 Binds:7082 Unbinds: 2443 - LDAP v2 Binds: 0 - LDAP v3 Binds: 6859 - AUTOBINDs: 223 - SSL Client Binds: 0 - Failed SSL Client Binds: 0 - SASL Binds:6814 GSSAPI - 6591 EXTERNAL - 223 - Directory Manager Binds: 0 - Anonymous Binds: 6591 - Other Binds: 491 strace timing on the ns-slapd process: % time seconds usecs/call callserrors syscall -- --- --- - - 94.400.346659597758 poll 4.100.015057 15057 1 restart_syscall 0.910.003353 575959 getpeername 0.490.001796 15012 futex 0.100.000364 73 5 read -- --- --- - - 100.000.367229 13559 total top output (with threads 'H'): PID USER
[Freeipa-users] ns-slapd high cpu usage
We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and 389-ds 1.3.3.1-16. Recently, the ns-slapd process on one of our replicas started showing higher than normal CPU usage. ns-slapd is pegged at high CPU usage more or less constantly. Seems very similar to this thread: https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not sure if these are related): [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993) [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993. [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error. [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993. [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993) [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993. [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error. [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] access logs seem to be showing normal activity. Here's the number of open connections: # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l 62 Note: the other two replicas have much higher open connections (250) and low cpu load avgs. Here's some output of logconv.pl from our most recent access log on the replica with high cpu load: Start of Logs:13/Jul/2015:04:49:18 End of Logs: 13/Jul/2015:10:06:11 Processed Log Time: 5 Hours, 16 Minutes, 53 Seconds Restarts: 0 Total Connections:2343 - LDAP Connections: 2120 - LDAPI Connections: 223 - LDAPS Connections: 0 - StartTLS Extended Ops: 45 Secure Protocol Versions: - TLS1.2 128-bit AES - 45 Peak Concurrent Connections: 22 Total Operations: 111865 Total Results:111034 Overall Performance: 99.3% Searches: 95585 (5.03/sec) (301.64/min) Modifications:3369 (0.18/sec) (10.63/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds:7082 (0.37/sec) (22.35/min) Proxied Auth Operations: 0 Persistent Searches: 0 Internal Operations: 0 Entry Operations: 0 Extended Operations: 5317 Abandoned Requests: 416 Smart Referrals Received: 0 VLV Operations: 96 VLV Unindexed Searches: 0 VLV Unindexed Components: 32 SORT Operations: 64 Entire Search Base Queries: 0 Paged Searches: 3882 Unindexed Searches: 0 Unindexed Components: 5 FDs Taken:2566 FDs Returned: 2643 Highest FD Taken: 249 Broken Pipes: 0 Connections Reset By Peer:0 Resource Unavailable: 0 Max BER Size Exceeded:0 Binds:7082 Unbinds: 2443 - LDAP v2 Binds: 0 - LDAP v3 Binds: 6859 - AUTOBINDs: 223 - SSL Client Binds: 0 - Failed SSL Client Binds: 0 - SASL Binds:6814 GSSAPI - 6591 EXTERNAL - 223 - Directory Manager Binds: 0 - Anonymous Binds: 6591 - Other Binds: 491 strace timing on the ns-slapd process: % time seconds usecs/call callserrors syscall -- --- --- - - 94.400.346659597758 poll 4.100.015057 15057 1 restart_syscall 0.910.003353 575959 getpeername 0.490.001796 15012 futex 0.100.000364 73 5 read -- --- --- - - 100.000.367229 13559 total top output (with threads 'H'): PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
Re: [Freeipa-users] ns-slapd high cpu usage
On 07/13/2015 05:05 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote: can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Attached is the full stacktrace of the running ns-slapd proccess. top -H shows this thread (2879) with high cpu usage: 2879 dirsrv20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd this thread is a replication thread sending updates, what is strange is that the current csn_str is quite old (july, 7th), I can't tell which agreeement this thread is handling, but looks like it is heavily reading the changeglog and sending updates. anything changed recently in replication setup ? On 07/13/2015 04:46 PM, Andrew E. Bruno wrote: We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and 389-ds 1.3.3.1-16. Recently, the ns-slapd process on one of our replicas started showing higher than normal CPU usage. ns-slapd is pegged at high CPU usage more or less constantly. Seems very similar to this thread: https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not sure if these are related): [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993) [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993. [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error. [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993. [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993) [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993. [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error. [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] access logs seem to be showing normal activity. Here's the number of open connections: # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l 62 Note: the other two replicas have much higher open connections (250) and low cpu load avgs. Here's some output of logconv.pl from our most recent access log on the replica with high cpu load: Start of Logs:13/Jul/2015:04:49:18 End of Logs: 13/Jul/2015:10:06:11 Processed Log Time: 5 Hours, 16 Minutes, 53 Seconds Restarts: 0 Total Connections:2343 - LDAP Connections: 2120 - LDAPI Connections: 223 - LDAPS Connections: 0 - StartTLS Extended Ops: 45 Secure Protocol Versions: - TLS1.2 128-bit AES - 45 Peak Concurrent Connections: 22 Total Operations: 111865 Total Results:111034 Overall Performance: 99.3% Searches: 95585 (5.03/sec) (301.64/min) Modifications:3369 (0.18/sec) (10.63/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds:7082 (0.37/sec) (22.35/min) Proxied Auth Operations: 0 Persistent Searches: 0 Internal Operations: 0 Entry Operations: 0 Extended Operations: 5317 Abandoned Requests: 416 Smart Referrals Received: 0 VLV Operations: 96 VLV Unindexed Searches: 0 VLV Unindexed Components: 32 SORT Operations: 64 Entire Search Base Queries: 0 Paged Searches: 3882 Unindexed Searches: 0 Unindexed Components: 5 FDs Taken:2566 FDs Returned: 2643 Highest FD Taken: 249 Broken Pipes: 0 Connections Reset By Peer:0 Resource Unavailable: 0 Max BER Size Exceeded:0 Binds:7082 Unbinds: 2443 - LDAP v2 Binds: 0 - LDAP v3 Binds: 6859 - AUTOBINDs: 223 - SSL Client Binds: 0 - Failed SSL Client Binds: 0 - SASL Binds:6814 GSSAPI - 6591 EXTERNAL - 223 - Directory Manager Binds: 0 - Anonymous Binds: 6591 - Other Binds: 491 strace timing on the ns-slapd
Re: [Freeipa-users] ns-slapd high cpu usage
On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote: On 07/13/2015 05:05 PM, Andrew E. Bruno wrote: On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote: can you get a pstack of the slapd process along with a top -H to find th ethread with high cpu usage Attached is the full stacktrace of the running ns-slapd proccess. top -H shows this thread (2879) with high cpu usage: 2879 dirsrv20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd this thread is a replication thread sending updates, what is strange is that the current csn_str is quite old (july, 7th), I can't tell which agreeement this thread is handling, but looks like it is heavily reading the changeglog and sending updates. anything changed recently in replication setup ? Yes, we had one replica fail on (6/19) which we removed (not this one showing high CPU load). Had to perform some manual cleanup of the ipa-ca RUVs. Then we added the replica back in on 7/1. Since then, replication appears to have been running normally between the 3 replicas. We've been monitoring utilization since 7/1 and only recently seen this spike (past 24 hours or so). On a side note, we get hit with this bug often: https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html (rouge sssd_be processing hammering a replica). This causes high ns-slapd utilization on the replica and restarting sssd on the client host immediately fixes the issue. However, in this case, we're not seeing this behavior. On 07/13/2015 04:46 PM, Andrew E. Bruno wrote: We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and 389-ds 1.3.3.1-16. Recently, the ns-slapd process on one of our replicas started showing higher than normal CPU usage. ns-slapd is pegged at high CPU usage more or less constantly. Seems very similar to this thread: https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not sure if these are related): [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993) [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993. [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error. [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1] [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993. [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993) [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993. [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error. [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1] access logs seem to be showing normal activity. Here's the number of open connections: # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l 62 Note: the other two replicas have much higher open connections (250) and low cpu load avgs. Here's some output of logconv.pl from our most recent access log on the replica with high cpu load: Start of Logs:13/Jul/2015:04:49:18 End of Logs: 13/Jul/2015:10:06:11 Processed Log Time: 5 Hours, 16 Minutes, 53 Seconds Restarts: 0 Total Connections:2343 - LDAP Connections: 2120 - LDAPI Connections: 223 - LDAPS Connections: 0 - StartTLS Extended Ops: 45 Secure Protocol Versions: - TLS1.2 128-bit AES - 45 Peak Concurrent Connections: 22 Total Operations: 111865 Total Results:111034 Overall Performance: 99.3% Searches: 95585 (5.03/sec) (301.64/min) Modifications:3369 (0.18/sec) (10.63/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds:7082 (0.37/sec) (22.35/min) Proxied Auth Operations: 0 Persistent Searches: 0 Internal Operations: 0 Entry Operations: 0 Extended Operations: 5317 Abandoned Requests: 416 Smart Referrals Received: 0 VLV Operations: 96 VLV Unindexed