[Freeipa-users] upgrade ipa-server fails changing dogtag key

2017-03-20 Thread Andrew E. Bruno
When yum updating our ipa-server running CentOS 7.3.1611 from
ipa-server-4.4.0-14.el7.centos.1.1.x86_64 to
ipa-server-4.4.0-14.el7.centos.6.x86_64 we got this error:


IPA server upgrade failed: Inspect /var/log/ipaupgrade.log and run command 
ipa-server-upgrade manually.
Unexpected error - see /var/log/ipaupgrade.log for details:
OSError: [Errno 2] No such file or directory: 
'/etc/pki/pki-tomcat/dogtag.keytab'
The ipa-server-upgrade command failed. See /var/log/ipaupgrade.log for more 
information

Inspecting /var/log/ipaupgrade.log shows this error:


2017-03-20T12:58:41Z DEBUG Process finished, return code=0
2017-03-20T12:58:41Z DEBUG stdout=Authenticating as principal root/admin@REALM 
with password.

2017-03-20T12:58:41Z DEBUG stderr=kadmin.local: Server error while changing 
dogtag/host@REAM's key

2017-03-20T12:58:41Z ERROR IPA server upgrade failed: Inspect 
/var/log/ipaupgrade.log and run command ipa-server-upgrade manually.
2017-03-20T12:58:41Z DEBUG   File 
"/usr/lib/python2.7/site-packages/ipapython/admintool.py", line 171, in execute
return_value = self.run()
  File 
"/usr/lib/python2.7/site-packages/ipaserver/install/ipa_server_upgrade.py", 
line 46, in run
server.upgrade()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/server/upgrade.py", 
line 1863, in upgrade
upgrade_configuration()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/server/upgrade.py", 
line 1796, in upgrade_configuration
ca.setup_lightweight_ca_key_retrieval()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/cainstance.py", line 
1400, in setup_lightweight_ca_key_retrieval
self.__setup_lightweight_ca_key_retrieval_kerberos()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/cainstance.py", line 
1431, in __setup_lightweight_ca_key_retrieval_kerberos
os.chmod(keytab, 0o600)

2017-03-20T12:58:41Z DEBUG The ipa-server-upgrade command failed, exception: 
OSError: [Errno 2] No such file or directory: 
'/etc/pki/pki-tomcat/dogtag.keytab'


The ipa services came back up (kinit is working and can login to the console).
This seems related to [1,2]. Checked to ensure that dogtag service points to
the default service password policy per [1]:

$ ipa service-show --all dogtag/host

  krbpwdpolicyreference: cn=Default Service Password 
Policy,cn=services,cn=accounts,dc=REALM

However when listing all the pwpolicies this doesn't seem to exist anywhere? We
only have a single global pwpolicy:

$ ipa pwpolicy-find
  Group: global_policy

Number of entries returned 1


Could this be related to the error? Any pointers on how to trouble shoot?

Thanks in advance.

--Andrew


[1] https://www.redhat.com/archives/freeipa-users/2017-March/msg00178.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1404910

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


[Freeipa-users] changelog entry cache size

2017-01-17 Thread Andrew E. Bruno
Just upgraded to CentOS 7.3 (freeipa 4.3 -> 4.4). Seeing this in the
error logs:


[17/Jan/2017:08:41:38.057173466 -0500] WARNING: changelog: entry cache size 
512000 B is less than db size 696377344 B; We recommend to increase
the entry cache size nsslapd-cachememsize.  [17/Jan/2017:08:41:38.058266046 
-0500] Setting ncache to: 2 to keep each chunk below 4Gbytes


Is it worth increasing the changelog entry cache size? 

We're seeing high cpu utilization spikes for the ns-slapd proccess.
Roughly every 2 mins ns-slapd spikes up to 65% utilization. Could this
be the cause?


Thanks!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] replica DS failure deadlock

2016-10-19 Thread Andrew E. Bruno
On Wed, Oct 19, 2016 at 06:59:57PM +0200, thierry bordaz wrote:
> 
> 
> On 10/19/2016 06:28 PM, Andrew E. Bruno wrote:
> > On Wed, Oct 19, 2016 at 05:41:37PM +0200, Ludwig Krispenz wrote:
> > > On 10/19/2016 05:02 PM, Ludwig Krispenz wrote:
> > > > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > > > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > > > > We had one of our replicas fail today with the following errors:
> > > > > > > 
> > > > > > > 
> > > > > > > [18/Oct/2016:13:40:47 -0400]
> > > > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu"
> > > > > > > (srv-m14-32:389) - Can't locate CSN 58065ef300010003 in
> > > > > > > the changelog (DB rc=-30988). If replication stops, the
> > > > > > > consumer may need to be reinitialized.
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > > > transaction (csn=58065f7400050004) failed (rc=-30993
> > > > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > > > deadlock))
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > > > entry with csn (58065f7400050004); db error - -30993
> > > > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > > > deadlock
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > write_changelog_and_ruv: can't add a change for
> > > > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > > > changelog csn 58065f7400050004
> > > > > > > [18/Oct/2016:13:43:07 -0400] -
> > > > > > > SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but
> > > > > > > did not set SLAPI_RESULT_CODE
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > process_postop: Failed to apply update
> > > > > > > (58065f7400050004) error (1).  Aborting replication
> > > > > > > session(conn=1314106 op=1688559)
> > > > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify:
> > > > > > > modified entry is NULL--updating cache just in case
> > > > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition
> > > > > > > cn=Password
> > > > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > > > Templates found, which should be added before the CoS
> > > > > > > Definition.
> > > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null
> > > > > > > DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get
> > > > > > > id for changenumber=30856302,cn=changelog from entryrdn
> > > > > > > index (-30993)
> > > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching
> > > > > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an
> > > > > > > error occured while adding change number 30856302, dn =
> > > > > > > changenumber=30856302,cn=changelog: Operations error.
> > > > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin -
> > > > > > > retrocl_postob: operation failure [1]
> > > > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin -
> > > > > > > process_postop: Failed to apply update
> > > > > > > (58065f9f0060) error (1).  Aborting replication
> > > > > > > session(conn=1901274 op=5)
> > > > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry
> > > > > > > BAD 1601, err=0 BDB0062 Successful return: 0
> > > > > > > [18/Oc

Re: [Freeipa-users] replica DS failure deadlock

2016-10-19 Thread Andrew E. Bruno
On Wed, Oct 19, 2016 at 07:05:14PM +0200, thierry bordaz wrote:
> 
> 
> On 10/19/2016 06:54 PM, Andrew E. Bruno wrote:
> > On Wed, Oct 19, 2016 at 06:33:05PM +0200, thierry bordaz wrote:
> > > 
> > > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > > > We had one of our replicas fail today with the following errors:
> > > > > > 
> > > > > > 
> > > > > > [18/Oct/2016:13:40:47 -0400] 
> > > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu" (srv-m14-32:389) - 
> > > > > > Can't locate CSN 58065ef300010003 in the changelog (DB 
> > > > > > rc=-30988). If replication stops, the consumer may need to be 
> > > > > > reinitialized.
> > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog 
> > > > > > program - _cl5WriteOperationTxn: retry (49) the transaction 
> > > > > > (csn=58065f7400050004) failed (rc=-30993 (BDB0068 
> > > > > > DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
> > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog 
> > > > > > program - _cl5WriteOperationTxn: failed to write entry with csn 
> > > > > > (58065f7400050004); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: 
> > > > > > Locker killed to resolve a deadlock
> > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - 
> > > > > > write_changelog_and_ruv: can't add a change for 
> > > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu 
> > > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to 
> > > > > > changelog csn 58065f7400050004
> > > > > > [18/Oct/2016:13:43:07 -0400] - SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN 
> > > > > > plugin returned error but did not set SLAPI_RESULT_CODE
> > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - 
> > > > > > process_postop: Failed to apply update (58065f7400050004) error 
> > > > > > (1).  Aborting replication session(conn=1314106 op=1688559)
> > > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify: modified 
> > > > > > entry is NULL--updating cache just in case
> > > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition cn=Password 
> > > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS 
> > > > > > Templates found, which should be added before the CoS Definition.
> > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null DN 
> > > > > > (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get id for 
> > > > > > changenumber=30856302,cn=changelog from entryrdn index (-30993)
> > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching 
> > > > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an error 
> > > > > > occured while adding change number 30856302, dn = 
> > > > > > changenumber=30856302,cn=changelog: Operations error.
> > > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin - retrocl_postob: 
> > > > > > operation failure [1]
> > > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin - 
> > > > > > process_postop: Failed to apply update (58065f9f0060) error 
> > > > > > (1).  Aborting replication session(conn=1901274 op=5)
> > > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry BAD 
> > > > > > 1601, err=0 BDB0062 Successful return: 0
> > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog 
> > > > > > program - _cl5WriteOperationTxn: retry (49) the transaction 
> > > > > > (csn=58065f7c000a0004) failed (rc=-30993 (BDB0068 
> > > > > > DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
> > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog 
> > > > > > program - _cl5WriteOperationTxn: failed to write entry with csn 
> > > > > > (58065f7c000a00

Re: [Freeipa-users] replica DS failure deadlock

2016-10-19 Thread Andrew E. Bruno
On Wed, Oct 19, 2016 at 06:33:05PM +0200, thierry bordaz wrote:
> 
> 
> On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > We had one of our replicas fail today with the following errors:
> > > > 
> > > > 
> > > > [18/Oct/2016:13:40:47 -0400] 
> > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu" (srv-m14-32:389) - Can't 
> > > > locate CSN 58065ef300010003 in the changelog (DB rc=-30988). If 
> > > > replication stops, the consumer may need to be reinitialized.
> > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program 
> > > > - _cl5WriteOperationTxn: retry (49) the transaction 
> > > > (csn=58065f7400050004) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: 
> > > > Locker killed to resolve a deadlock))
> > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program 
> > > > - _cl5WriteOperationTxn: failed to write entry with csn 
> > > > (58065f7400050004); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: 
> > > > Locker killed to resolve a deadlock
> > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - 
> > > > write_changelog_and_ruv: can't add a change for 
> > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu 
> > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to changelog 
> > > > csn 58065f7400050004
> > > > [18/Oct/2016:13:43:07 -0400] - SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN 
> > > > plugin returned error but did not set SLAPI_RESULT_CODE
> > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - process_postop: 
> > > > Failed to apply update (58065f7400050004) error (1).  Aborting 
> > > > replication session(conn=1314106 op=1688559)
> > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify: modified entry 
> > > > is NULL--updating cache just in case
> > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition cn=Password 
> > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates 
> > > > found, which should be added before the CoS Definition.
> > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null DN 
> > > > (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get id for 
> > > > changenumber=30856302,cn=changelog from entryrdn index (-30993)
> > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching 
> > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an error occured 
> > > > while adding change number 30856302, dn = 
> > > > changenumber=30856302,cn=changelog: Operations error.
> > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin - retrocl_postob: operation 
> > > > failure [1]
> > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin - process_postop: 
> > > > Failed to apply update (58065f9f0060) error (1).  Aborting 
> > > > replication session(conn=1901274 op=5)
> > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry BAD 1601, 
> > > > err=0 BDB0062 Successful return: 0
> > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program 
> > > > - _cl5WriteOperationTxn: retry (49) the transaction 
> > > > (csn=58065f7c000a0004) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: 
> > > > Locker killed to resolve a deadlock))
> > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program 
> > > > - _cl5WriteOperationTxn: failed to write entry with csn 
> > > > (58065f7c000a0004); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: 
> > > > Locker killed to resolve a deadlock
> > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - 
> > > > write_changelog_and_ruv: can't add a change for 
> > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu 
> > > > (uniqid: 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to changelog 
> > > > csn 58065f7c000a0004
> > > > 
> > > > 
> > > > ns-slapd was hung so we restarted and now it's stuck and won't come 
> > > > back up. It
> > > > hangs up here:
> > > > 
> > > > [18/Oct/20

Re: [Freeipa-users] replica DS failure deadlock

2016-10-19 Thread Andrew E. Bruno
On Wed, Oct 19, 2016 at 05:41:37PM +0200, Ludwig Krispenz wrote:
> 
> On 10/19/2016 05:02 PM, Ludwig Krispenz wrote:
> > 
> > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > > We had one of our replicas fail today with the following errors:
> > > > > 
> > > > > 
> > > > > [18/Oct/2016:13:40:47 -0400]
> > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu"
> > > > > (srv-m14-32:389) - Can't locate CSN 58065ef300010003 in
> > > > > the changelog (DB rc=-30988). If replication stops, the
> > > > > consumer may need to be reinitialized.
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > transaction (csn=58065f7400050004) failed (rc=-30993
> > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock))
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > entry with csn (58065f7400050004); db error - -30993
> > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > write_changelog_and_ruv: can't add a change for
> > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > changelog csn 58065f7400050004
> > > > > [18/Oct/2016:13:43:07 -0400] -
> > > > > SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but
> > > > > did not set SLAPI_RESULT_CODE
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > process_postop: Failed to apply update
> > > > > (58065f7400050004) error (1).  Aborting replication
> > > > > session(conn=1314106 op=1688559)
> > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify:
> > > > > modified entry is NULL--updating cache just in case
> > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition
> > > > > cn=Password
> > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > Templates found, which should be added before the CoS
> > > > > Definition.
> > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null
> > > > > DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get
> > > > > id for changenumber=30856302,cn=changelog from entryrdn
> > > > > index (-30993)
> > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching
> > > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an
> > > > > error occured while adding change number 30856302, dn =
> > > > > changenumber=30856302,cn=changelog: Operations error.
> > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin -
> > > > > retrocl_postob: operation failure [1]
> > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin -
> > > > > process_postop: Failed to apply update
> > > > > (58065f9f0060) error (1).  Aborting replication
> > > > > session(conn=1901274 op=5)
> > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry
> > > > > BAD 1601, err=0 BDB0062 Successful return: 0
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > transaction (csn=58065f7c000a0004) failed (rc=-30993
> > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock))
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > entry with csn (58065f7c000a0004); db error - -30993
> > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock
> > > > > [18/Oct/2016:

Re: [Freeipa-users] replica DS failure deadlock

2016-10-19 Thread Andrew E. Bruno
On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> 
> On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > We had one of our replicas fail today with the following errors:
> > 
> > 
> > [18/Oct/2016:13:40:47 -0400] agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu" 
> > (srv-m14-32:389) - Can't locate CSN 58065ef300010003 in the changelog 
> > (DB rc=-30988). If replication stops, the consumer may need to be 
> > reinitialized.
> > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program - 
> > _cl5WriteOperationTxn: retry (49) the transaction 
> > (csn=58065f7400050004) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: 
> > Locker killed to resolve a deadlock))
> > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program - 
> > _cl5WriteOperationTxn: failed to write entry with csn 
> > (58065f7400050004); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker 
> > killed to resolve a deadlock
> > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - 
> > write_changelog_and_ruv: can't add a change for 
> > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu (uniqid: 
> > 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to changelog csn 
> > 58065f7400050004
> > [18/Oct/2016:13:43:07 -0400] - SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin 
> > returned error but did not set SLAPI_RESULT_CODE
> > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - process_postop: Failed 
> > to apply update (58065f7400050004) error (1).  Aborting replication 
> > session(conn=1314106 op=1688559)
> > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify: modified entry is 
> > NULL--updating cache just in case
> > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition cn=Password 
> > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates 
> > found, which should be added before the CoS Definition.
> > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null DN 
> > (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get id for 
> > changenumber=30856302,cn=changelog from entryrdn index (-30993)
> > [18/Oct/2016:13:43:20 -0400] - Operation error fetching 
> > changenumber=30856302,cn=changelog (null), error -30993.
> > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an error occured 
> > while adding change number 30856302, dn = 
> > changenumber=30856302,cn=changelog: Operations error.
> > [18/Oct/2016:13:43:20 -0400] retrocl-plugin - retrocl_postob: operation 
> > failure [1]
> > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin - process_postop: Failed 
> > to apply update (58065f9f0060) error (1).  Aborting replication 
> > session(conn=1901274 op=5)
> > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 
> > BDB0062 Successful return: 0
> > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program - 
> > _cl5WriteOperationTxn: retry (49) the transaction 
> > (csn=58065f7c000a0004) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: 
> > Locker killed to resolve a deadlock))
> > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program - 
> > _cl5WriteOperationTxn: failed to write entry with csn 
> > (58065f7c000a0004); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker 
> > killed to resolve a deadlock
> > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - 
> > write_changelog_and_ruv: can't add a change for 
> > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu (uniqid: 
> > 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to changelog csn 
> > 58065f7c000a0004
> > 
> > 
> > ns-slapd was hung so we restarted and now it's stuck and won't come back 
> > up. It
> > hangs up here:
> > 
> > [18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition cn=Password 
> > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates 
> > found, which should be added before the CoS Definition.
> > [18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin - changelog program - 
> > _cl5NewDBFile: PR_DeleteSemaphore: 
> > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema;
> >  NSPR error - -5943
> > [18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin - changelog program - 
> > _cl5NewDBFile: PR_DeleteSemaphore: 
> > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema;
> >  NSPR error - -5943
> > 
> > 
> > Tried deleting the semaphore files and restarting 

[Freeipa-users] replica DS failure deadlock

2016-10-18 Thread Andrew E. Bruno
We had one of our replicas fail today with the following errors:


[18/Oct/2016:13:40:47 -0400] agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu" 
(srv-m14-32:389) - Can't locate CSN 58065ef300010003 in the changelog (DB 
rc=-30988). If replication stops, the consumer may need to be reinitialized.
[18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program - 
_cl5WriteOperationTxn: retry (49) the transaction (csn=58065f7400050004) 
failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a 
deadlock))
[18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - changelog program - 
_cl5WriteOperationTxn: failed to write entry with csn (58065f7400050004); 
db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - write_changelog_and_ruv: 
can't add a change for 
uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu (uniqid: 
939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to changelog csn 
58065f7400050004
[18/Oct/2016:13:43:07 -0400] - SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin 
returned error but did not set SLAPI_RESULT_CODE
[18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - process_postop: Failed to 
apply update (58065f7400050004) error (1).  Aborting replication 
session(conn=1314106 op=1688559)
[18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify: modified entry is 
NULL--updating cache just in case
[18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition cn=Password 
Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, 
which should be added before the CoS Definition.
[18/Oct/2016:13:43:20 -0400] - Operation error fetching Null DN 
(4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
[18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get id for 
changenumber=30856302,cn=changelog from entryrdn index (-30993)
[18/Oct/2016:13:43:20 -0400] - Operation error fetching 
changenumber=30856302,cn=changelog (null), error -30993.
[18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 30856302, dn = changenumber=30856302,cn=changelog: 
Operations error.
[18/Oct/2016:13:43:20 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin - process_postop: Failed to 
apply update (58065f9f0060) error (1).  Aborting replication 
session(conn=1901274 op=5)
[18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 
BDB0062 Successful return: 0
[18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program - 
_cl5WriteOperationTxn: retry (49) the transaction (csn=58065f7c000a0004) 
failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a 
deadlock))
[18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - changelog program - 
_cl5WriteOperationTxn: failed to write entry with csn (58065f7c000a0004); 
db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - write_changelog_and_ruv: 
can't add a change for 
uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu (uniqid: 
4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to changelog csn 
58065f7c000a0004


ns-slapd was hung so we restarted and now it's stuck and won't come back up. It
hangs up here:

[18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition cn=Password 
Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS Templates found, 
which should be added before the CoS Definition.
[18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin - changelog program - 
_cl5NewDBFile: PR_DeleteSemaphore: 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema;
 NSPR error - -5943
[18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin - changelog program - 
_cl5NewDBFile: PR_DeleteSemaphore: 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema;
 NSPR error - -5943


Tried deleting the semaphore files and restarting but no luck. Attached
is a stacktrace of the stuck ns-slapd process.

Here's the versions were running:

ipa-server-4.2.0-15.0.1.el7.centos.19.x86_64
389-ds-base-1.3.4.0-33.el7_2.x86_64

FWIW, we were experimenting with the new life-cycle management features,
specifically "preserved" users and deleted the user "janedoe" when this
happened.  From the errors above looks like this host failed to
replicate the change?  Not sure if this is related or not. 

Is it possible to recover the database? Thanks in advance for any pointers.


--Andrew

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as 

Re: [Freeipa-users] ipa replica failed PR_DeleteSemaphore

2016-03-14 Thread Andrew E. Bruno
On Mon, Mar 14, 2016 at 09:35:15AM +0100, Ludwig Krispenz wrote:
> 
> On 03/12/2016 04:02 PM, Andrew E. Bruno wrote:
> >On Wed, Mar 09, 2016 at 06:08:04PM +0100, Ludwig Krispenz wrote:
> >>On 03/09/2016 05:51 PM, Andrew E. Bruno wrote:
> >>>On Wed, Mar 09, 2016 at 05:21:50PM +0100, Ludwig Krispenz wrote:
> >>>
> >>>[09/Mar/2016:11:33:03 -0500] NSMMReplicationPlugin - changelog program - 
> >>>_cl5NewDBFile: PR_DeleteSemaphore: 
> >>>/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/ed35d212-2cb811e5-af63d574-de3f6355.sema;
> >>> NSPR error - -5943
> >>if ds is cleanly shutdown this file should be removed, if ds is killed it
> >>remains and should be recreated at restart, which fails. could you try
> >>another stop, remove the file manually and start again ?
> >>>
> >We had our replicas crash again. Curious if it's safe to delete the
> >other db files as well:
> >
> >ls -alh /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/
> >   30  DBVERSION
> >6.8G  ed35d212-2cb811e5-af63d574-de3f6355_55a955910004.db
> >0  ed35d212-2cb811e5-af63d574-de3f6355.sema
> >  18M  f32bb356-2cb811e5-af63d574-de3f6355_55a955ca0060.db
> >0  f32bb356-2cb811e5-af63d574-de3f6355.sema
> >
> >
> >Should all these files be deleted if the ds is cleanly shutdown? or should we
> >only remove the *.sema files.
> the *.db file contains the data of the changelog, if you delete them you
> start with a new cl and could get into replication problems requiring
> reinitialization. you normally shoul not delete them.
> The .sema is used to control how many threads can concurrently access the
> cl, it should be recreated at restart, so it is safe to delete them after a
> crash.

Sounds good..thanks. We deleted the .sema files after the crash and the
replicas came back up ok.

> 
> If you getting frequent crashes, we shoul try to find the reason for the
> crashes, could you try to get a core file ?

This time we had two replicas crash and ns-slapd wasn't running so we
couldn't grab a pstack. Here's a snip from the error logs right before
the crash (not sure if this is related or not):

[11/Mar/2016:09:57:56 -0500] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=11573832,cn=changelog!!
[11/Mar/2016:09:57:57 -0500] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=11575824,cn=changelog!!
[11/Mar/2016:09:57:58 -0500] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=11575851,cn=changelog!!
[11/Mar/2016:10:00:28 -0500] - libdb: BDB2055 Lock table is out of available 
lock entries
[11/Mar/2016:10:00:28 -0500] NSMMReplicationPlugin - changelog program - 
_cl5CompactDBs: failed to compact 986efe12-71b811e5-9d33a516-e778e883; db error 
- 12 Cannot allocate memory
[11/Mar/2016:10:02:07 -0500] - libdb: BDB2055 Lock table is out of available 
lock entries
[11/Mar/2016:10:02:07 -0500] - compactdb: failed to compact changelog; db error 
- 12 Cannot allocate memory
[11/Mar/2016:12:36:18 -0500] - slapd_poll(377) timed out
[11/Mar/2016:13:06:17 -0500] - slapd_poll(377) timed out

We just upgraded to ipa 4.2 centos 7.2 and if we see anymore crashes
we'll try and get more info.

Thanks again.

--Andrew


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failed PR_DeleteSemaphore

2016-03-12 Thread Andrew E. Bruno
On Wed, Mar 09, 2016 at 06:08:04PM +0100, Ludwig Krispenz wrote:
> 
> On 03/09/2016 05:51 PM, Andrew E. Bruno wrote:
> >On Wed, Mar 09, 2016 at 05:21:50PM +0100, Ludwig Krispenz wrote:
> >
> >[09/Mar/2016:11:33:03 -0500] NSMMReplicationPlugin - changelog program - 
> >_cl5NewDBFile: PR_DeleteSemaphore: 
> >/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/ed35d212-2cb811e5-af63d574-de3f6355.sema;
> > NSPR error - -5943
> if ds is cleanly shutdown this file should be removed, if ds is killed it
> remains and should be recreated at restart, which fails. could you try
> another stop, remove the file manually and start again ?
> >
> >

We had our replicas crash again. Curious if it's safe to delete the
other db files as well:

ls -alh /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/
  30  DBVERSION
6.8G  ed35d212-2cb811e5-af63d574-de3f6355_55a955910004.db
   0  ed35d212-2cb811e5-af63d574-de3f6355.sema
 18M  f32bb356-2cb811e5-af63d574-de3f6355_55a955ca0060.db
   0  f32bb356-2cb811e5-af63d574-de3f6355.sema


Should all these files be deleted if the ds is cleanly shutdown? or should we
only remove the *.sema files.

Thanks,

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failed PR_DeleteSemaphore

2016-03-09 Thread Andrew E. Bruno
On Wed, Mar 09, 2016 at 05:21:50PM +0100, Ludwig Krispenz wrote:
> 
> On 03/09/2016 04:46 PM, Andrew E. Bruno wrote:
> >On Wed, Mar 09, 2016 at 10:37:05AM -0500, Andrew E. Bruno wrote:
> >>On Wed, Mar 09, 2016 at 04:13:28PM +0100, Ludwig Krispenz wrote:
> >>>if the process hangs, could you get a pstack from the process ?
> >>
> >>I'd be happy to provide a pstack but can't seem to get the correct debuginfo
> >>packages installed.. we're running centos7 and  389-ds-base 1.3.3.1. We 
> >>haven't
> >>upgraded to 1.3.4.0. How can I get the debuginfo packages installed for that
> >>specific version.
> >Nevermind.. i got the debuginfo packages. Attached is the stacktrace of
> >our second failed replicate that's currently hung.
> not sure, but the process could be in a deadlock, there are threads in the
> retro cl and memberof plugin and we have seen deadlocks there. In that case
> restart or stop would not be able to stop the ds process. You can try ipactl
> stop and if the ds process is still running, you have to kill it

Our first master came back up after the restart and appears to be
working again. The second replica that was hung, we did a ipactl stop
and it killed the ds process. Running a ipactl start now. We got the
same error:


[09/Mar/2016:11:33:03 -0500] NSMMReplicationPlugin - changelog program - 
_cl5NewDBFile: PR_DeleteSemaphore: 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/ed35d212-2cb811e5-af63d574-de3f6355.sema;
 NSPR error - -5943


We're going to let this run and hopefully will come back up. 

Just want to confirm again that these can can be safely ignored:

[09/Mar/2016:10:23:10 -0500] DSRetroclPlugin - delete_changerecord: could not 
delete change record 11272989 (rc: 32)
[09/Mar/2016:10:23:10 -0500] DSRetroclPlugin - delete_changerecord: could not 
delete change record 11272990 (rc: 32)

They fill up the logs when bringing the ds back up.

We seem to keep getting bit by this deadlock [1,2]. Replicas become
unresponsive, file descriptor counts increase. Other than a pstack, if there's
any other info we can provide/check let us know. 

We'll be upgrading to centos 7.2 and 389-ds-base 1.3.4 next week.

As always, thanks again for the help and quick responses. 

Best,

--Andrew

[1] https://www.redhat.com/archives/freeipa-users/2015-September/msg6.html
[2] https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failed PR_DeleteSemaphore

2016-03-09 Thread Andrew E. Bruno
On Wed, Mar 09, 2016 at 04:13:28PM +0100, Ludwig Krispenz wrote:
> 
> On 03/09/2016 03:46 PM, Andrew E. Bruno wrote:
> >Hello,
> >
> >We had a replica fail today with:
> >
> >[09/Mar/2016:09:39:59 -0500] NSMMReplicationPlugin - changelog program - 
> >_cl5NewDBFile: PR_DeleteSemaphore: 
> >/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema;
> > NSPR error - -5943
> the nspr error means:
> /* Cannot create or rename a filename that already exists */
> #define PR_FILE_EXISTS_ERROR (-5943L)
> 
> could you check if the file exists and if there is a permission problem for
> the dirsrv user to recreate it ?

Looks like the file exists:


# ls -alh 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema
-rw-r--r-- 1 dirsrv dirsrv 0 Mar  9 09:39 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema

> if the process hangs, could you get a pstack from the process ?

We did a systemctl restart ipa.. which failed.. but looks like the dirsrv is 
still running. The logs are now filling up with:

[09/Mar/2016:10:23:10 -0500] DSRetroclPlugin - delete_changerecord: could not 
delete change record 11272988 (rc: 32)
[09/Mar/2016:10:23:10 -0500] DSRetroclPlugin - delete_changerecord: could not 
delete change record 11272989 (rc: 32)
[09/Mar/2016:10:23:10 -0500] DSRetroclPlugin - delete_changerecord: could not 
delete change record 11272990 (rc: 32)

However, if I do a kinit:

kinit: Cannot contact any KDC for realm 'CBLS.CCR.BUFFALO.EDU' while getting 
initial credentials

Should I be concerned that this will end up corrupting the other replicas? 
Should we just let this finish?

We have 3 replicas in our system. Looks like we just lost a second one. This
feels very similar to the error we hit a while back:

https://www.redhat.com/archives/freeipa-users/2015-September/msg6.html

We're seeing the exact same behavior.. access logs are filling up with:

[09/Mar/2016:10:26:03 -0500] conn=6877203 fd=4003 slot=4003 connection from 
10.113.14.131 to 10.113.14.131
[09/Mar/2016:10:26:03 -0500] conn=6877204 fd=4004 slot=4004 connection from 
10.116.28.10 to 10.113.14.131
[09/Mar/2016:10:26:09 -0500] conn=6877205 fd=4005 slot=4005 connection from 
10.113.14.131 to 10.113.14.131
[09/Mar/2016:10:26:15 -0500] conn=6877206 fd=4006 slot=4006 connection from 
10.113.14.131 to 10.113.14.131
[09/Mar/2016:10:26:21 -0500] conn=6877207 fd=4007 slot=4007 connection from 
10.113.14.131 to 10.113.14.131
[09/Mar/2016:10:26:27 -0500] conn=6877208 fd=4008 slot=4008 connection from 
10.113.14.131 to 10.113.14.131
[09/Mar/2016:10:26:28 -0500] conn=6877209 fd=4009 slot=4009 connection from 
10.116.28.33 to 10.113.14.131
[09/Mar/2016:10:26:30 -0500] conn=6877210 fd=4010 slot=4010 connection from 
10.116.28.23 to 10.113.14.131
[09/Mar/2016:10:26:33 -0500] conn=6877211 fd=4011 slot=4011 connection from 
10.113.14.131 to 10.113.14.131

The ns-slapd proccess is showing this from top:

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
24951 dirsrv20   0 15.477g 0.013t 6.067g S   0.0 27.3 101566:54 ns-slapd


I'd be happy to provide a pstack but can't seem to get the correct debuginfo
packages installed.. we're running centos7 and  389-ds-base 1.3.3.1. We haven't
upgraded to 1.3.4.0. How can I get the debuginfo packages installed for that
specific version.

Thanks!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


[Freeipa-users] ipa replica failed PR_DeleteSemaphore

2016-03-09 Thread Andrew E. Bruno
Hello,

We had a replica fail today with:

[09/Mar/2016:09:39:59 -0500] NSMMReplicationPlugin - changelog program - 
_cl5NewDBFile: PR_DeleteSemaphore: 
/var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/e909b405-2cb811e5-ac0b8f7e-e0b1a377.sema;
 NSPR error - -5943


dirsrv just hangs here. Doesn't seem to want to start up.. any pointers on 
where how to debug?

This is our first master so want to try and save it if possible.

Thanks,

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] nsslapd-dbcachesize and database size

2015-10-14 Thread Andrew E. Bruno
On Wed, Oct 14, 2015 at 07:59:23AM -0600, Rich Megginson wrote:
> On 10/14/2015 07:09 AM, Andrew E. Bruno wrote:
> >The load average on our freeipa replicas started to spike over the
> >last few days and we narrowed it down to a dbcache issue. Following the
> >guidelines here: https://github.com/richm/scripts/wiki/dbmon.sh
> >
> >We saw that the dbcachefree was 2.0% which indicates a lot of page
> >churn. Sure enough our nsslapd-dbcachesize was set to 2G and the size of
> >our database and index files was 3.1G:
> >
> >$ du -sh /var/lib/dirsrv/slapd-[domain]/db/
> >3.1G
> >
> >Once we increased nsslapd-dbcachesize to 6G load average went back to
> >normal and query response times improved. Interestingly, when we
> >restarted the dirsrv process the database size went down to 1.7G
> >
> >$ du -sh /var/lib/dirsrv/slapd-[domain]/db/
> >1.7G
> >
> >When we initially deployed freeipa, the size of our database and indexes
> >was about 400M which is why we set nsslapd-dbcachesize to 2G.
> 
> What about your cachememsize?

We have nsslapd-cachememsize set at 2G for the cn=userRoot. According to
dbmon.sh it looked OK and we didn't think it needed to be increased:

dbcachefree 6123847680 free% 95.055 roevicts 0 hit% 99 pagein 34260 pageout 
308661

   dbname  count  free  free%size
changelog:ent 84158342   30.9  4210.2
changelog:dn   29716580.074.0
 userroot:ent   84462091021433   97.4  6685.1
 userroot:dn8446 523272268   99.8   120.3
ipaca:ent100673516   47.9  7338.6
ipaca:dn 100   1399359   99.480.1


The changelog:dn seems to vary so much not sure how to tune that one. Any
suggestions? It's almost always 0% free.

> 
> >
> >A few questions:
> >
> >1. What causes the increase in size of
> >/var/lib/dirsrv/slapd-[domain]/db/*  and should we periodically clean up?
> 
> Replication metadata accounts for some of this.  Fragmentation accounts for
> some of this.  You can periodically clean up, but you shouldn't have to.
> The growth should eventually hit a plateau.
> 
> >
> >2. How do you tune nsslapd-dbcachesize to account for this growth? The
> >dbmon.sh wiki suggests a 12% overhead but our db files and indexes seem
> >to grow much larger?
> 
> 12% is sort of a starting point.  There isn't a good way to tell how to
> account for replication metadata, fragmentation, etc.  Just monitor
> periodically and adjust as needed.

Great, Thanks Rich!

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


[Freeipa-users] nsslapd-dbcachesize and database size

2015-10-14 Thread Andrew E. Bruno
The load average on our freeipa replicas started to spike over the
last few days and we narrowed it down to a dbcache issue. Following the
guidelines here: https://github.com/richm/scripts/wiki/dbmon.sh

We saw that the dbcachefree was 2.0% which indicates a lot of page
churn. Sure enough our nsslapd-dbcachesize was set to 2G and the size of
our database and index files was 3.1G: 

$ du -sh /var/lib/dirsrv/slapd-[domain]/db/
3.1G

Once we increased nsslapd-dbcachesize to 6G load average went back to
normal and query response times improved. Interestingly, when we
restarted the dirsrv process the database size went down to 1.7G

$ du -sh /var/lib/dirsrv/slapd-[domain]/db/
1.7G

When we initially deployed freeipa, the size of our database and indexes
was about 400M which is why we set nsslapd-dbcachesize to 2G. 

A few questions:

1. What causes the increase in size of
/var/lib/dirsrv/slapd-[domain]/db/*  and should we periodically clean up?

2. How do you tune nsslapd-dbcachesize to account for this growth? The
dbmon.sh wiki suggests a 12% overhead but our db files and indexes seem
to grow much larger? 

We're running: 
- ipa-server-4.1.0-18.el7.centos.4.x86_64 and
- 389-ds-base-1.3.3.1-20.el7_1.x86_64

Thanks,

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] re-initialize replica

2015-10-06 Thread Andrew E. Bruno
On Tue, Oct 06, 2015 at 10:22:44AM -0400, Rob Crittenden wrote:
> Andrew E. Bruno wrote:
> > On Tue, Oct 06, 2015 at 09:35:08AM -0400, Rob Crittenden wrote:
> >> Andrew E. Bruno wrote:
> >>> The replica is not showing up when running ipa-replica-manage list.
> >>>
> >>>   # ipa-replica-manage list
> >>>   srv-m14-32.cbls.ccr.buffalo.edu: master
> >>>   srv-m14-31-02.cbls.ccr.buffalo.edu: master
> >>>
> >>>
> >>> However, still seeing the ruvs in ldapsearch:
> >>>
> >>> ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
> >>> objectClass=nsDS5ReplicationAgreement -LL
> >>>
> >>>
> >>> nsds50ruv: {replica 5 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 
> >>> 55afec6b
> >>>  0005 55b2aa6800020005
> >>>
> >>>
> >>> ..
> >>>
> >>> nsds50ruv: {replica 91 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 
> >>> 55afecb
> >>>  0005b 55b13e74005b
> >>>
> >>>
> >>> Should I clean these manually? or can I run: ipa-replica-manage clean-ruv 
> >>> 5
> >>>
> >>> Thanks again for the all the help.
> >>>
> >>> --Andrew
> >>>
> >>>
> >>
> >> Note that the list of masters comes from entries in IPA, not from
> >> replication agreements.
> >>
> >> ipa-replica-manage list-ruv will show the RUV data in a simpler way.
> >>
> >> Yeah, I'd use clean-ruv to clean them up.
> >>
> >> rob
> >>
> >>
> > 
> > I get an error trying to clean-ruv:
> > 
> >   # ipa-replica-manage clean-ruv 5
> >   Replica ID 5 not found
> > 
> > Can these safely be ignored? or will we hit problems when adding the
> > replica back in?
> 
> ipa-replica-manage list-ruv will show you the current RUV list. If it
> isn't there then yeah, you're done.
> 
> Having unused RUV in a master causes it to do unnecessary replication
> calculations.
> 
> rob

Yes, list-ruv seems to show the correct RUV list. 

# ipa-replica-manage list-ruv
srv-m14-32.cbls.ccr.buffalo.edu:389: 4
srv-m14-31-02.cbls.ccr.buffalo.edu:389: 3

It's just the ldapsearch that's showing repid 5 :

ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=meTosrv-m14-31-02.cbls.ccr.buffalo.edu,cn=replica,cn=dc\3Dcbls\2Cdc\3Dc
 cr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
cn: meTosrv-m14-31-02.cbls.ccr.buffalo.edu
objectClass: nsds5replicationagreement
objectClass: top
..
nsds50ruv: {replica 5 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 55afec6b
 0005 55b2aa6800020005



dn: cn=masterAgreement1-srv-m14-31-02.cbls.ccr.buffalo.edu-pki-tomcat,cn=repli
 ca,cn=o\3Dipaca,cn=mapping tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
...
nsds50ruv: {replica 91 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 55afecb
 0005b 55b13e74005b



Last time we had a replicate fail we manually ran a cleanall ruv via ldapmodify
for the ipaca rid which wasn't properly deleted. However, this time we're
seeing the rid in both ipca dn and the replica dn?

Just want to be sure.. are you saying these can be safely ignored? and we can
trust that the list-ruv is correct (and not causing unnecessary replication
calculations). We plan on adding the failed replica back with the same name.

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] re-initialize replica

2015-10-06 Thread Andrew E. Bruno
On Tue, Oct 06, 2015 at 02:29:49PM -0400, Mark Reynolds wrote:
> 
> 
> On 10/06/2015 01:13 PM, Andrew E. Bruno wrote:
> >On Tue, Oct 06, 2015 at 12:53:04PM -0400, Mark Reynolds wrote:
> >>
> >>On 10/06/2015 10:30 AM, Andrew E. Bruno wrote:
> >>>On Tue, Oct 06, 2015 at 10:22:44AM -0400, Rob Crittenden wrote:
> >>>>Andrew E. Bruno wrote:
> >>>>>On Tue, Oct 06, 2015 at 09:35:08AM -0400, Rob Crittenden wrote:
> >>>>>>Andrew E. Bruno wrote:
> >>>>>>>The replica is not showing up when running ipa-replica-manage list.
> >>>>>>>
> >>>>>>>   # ipa-replica-manage list
> >>>>>>>   srv-m14-32.cbls.ccr.buffalo.edu: master
> >>>>>>>   srv-m14-31-02.cbls.ccr.buffalo.edu: master
> >>>>>>>
> >>>>>>>
> >>>>>>>However, still seeing the ruvs in ldapsearch:
> >>>>>>>
> >>>>>>>ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
> >>>>>>>objectClass=nsDS5ReplicationAgreement -LL
> >>>>>>>
> >>>>>>>
> >>>>>>>nsds50ruv: {replica 5 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 
> >>>>>>>55afec6b
> >>>>>>>  0005 55b2aa6800020005
> >>>>>>>
> >>>>>>>
> >>>>>>>..
> >>>>>>>
> >>>>>>>nsds50ruv: {replica 91 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 
> >>>>>>>55afecb
> >>>>>>>  0005b 55b13e74005b
> >>>>>>>
> >>>>>>>
> >>>>>>>Should I clean these manually? or can I run: ipa-replica-manage 
> >>>>>>>clean-ruv 5
> >>>>>>>
> >>>>>>>Thanks again for the all the help.
> >>>>>>>
> >>>>>>>--Andrew
> >>>>>>>
> >>>>>>>
> >>>>>>Note that the list of masters comes from entries in IPA, not from
> >>>>>>replication agreements.
> >>>>>>
> >>>>>>ipa-replica-manage list-ruv will show the RUV data in a simpler way.
> >>>>>>
> >>>>>>Yeah, I'd use clean-ruv to clean them up.
> >>>>>>
> >>>>>>rob
> >>>>>>
> >>>>>>
> >>>>>I get an error trying to clean-ruv:
> >>>>>
> >>>>>   # ipa-replica-manage clean-ruv 5
> >>>>>   Replica ID 5 not found
> >>>>>
> >>>>>Can these safely be ignored? or will we hit problems when adding the
> >>>>>replica back in?
> >>>>ipa-replica-manage list-ruv will show you the current RUV list. If it
> >>>>isn't there then yeah, you're done.
> >>>>
> >>>>Having unused RUV in a master causes it to do unnecessary replication
> >>>>calculations.
> >>>>
> >>>>rob
> >>>Yes, list-ruv seems to show the correct RUV list.
> >>>
> >>># ipa-replica-manage list-ruv
> >>>srv-m14-32.cbls.ccr.buffalo.edu:389: 4
> >>>srv-m14-31-02.cbls.ccr.buffalo.edu:389: 3
> >>>
> >>>It's just the ldapsearch that's showing repid 5 :
> >>>
> >>>ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
> >>>objectClass=nsDS5ReplicationAgreement -LL
> >>I think this can be ignored sicne its on the repl agreement, and not the
> >>backend.
> >>
> >>What does this ldapsearch return:
> >>
> >>replace -b with your suffix
> >>
> >>ldapsearch -Y GSSAPI -b|"dc=example,dc=com" 
> >>'(&(nsuniqueid=---)(objectclass=nstombstone))'
> >>nsds50ruv|
> >>
> >>
> >>Mark
> >
> >Here's the results of the above query:
> >
> >
> >dn: cn=replica,cn=dc\3Dcbls\2Cdc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
> >tr
> >  ee,cn=config
> >nsds50ruv: {replicageneration} 55a955910004
> >nsds50ruv: {replica 4 ldap://srv-m14-32.cbls.ccr.buffalo.edu:389} 
> >55a955fa
> >  0004 561400b300070004
> >nsds50ruv: {replica 3 ldap://srv-m14-31-02.cbls.ccr.buffalo.edu:389} 
> >55a955960
> >  003 5613f7b500020003
> >nsds50ruv: {replica 5} 5600051d0015 5600051d0015
> >
> >
> >Still see that replica 5? Is that normal?
> It's still present, and if you were having replication issues its possible
> the changelog recreated that old replica ID (replica 5) after it was
> cleaned.  This changelog resurrection bug has been fixed upstream- fyi.
> 
> So, you need to rerun cleanallruv.  If the IPA CLI is not detecting the
> replica id you are trying to delete, then you can run the 389-ds-base
> cleanallruv.pl script and run it on the server with the old rid:
> 
> cleanallruv.pl -D "cn=directory manager"  -w password -b
> "dc=cbls,dc=ccr,dc=buffalo,dc=edu" -r 5
> 
> Wait a minute, and rerun that ldapsearch to see if the replica ID was
> removed/cleaned.
> 
> Mark

Worked like a charm. cleanallruv.pl removed the replicate id and I
verified it was gone via the ldapsearch. 

Thanks so much. Appreciate all the help.

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] re-initialize replica

2015-10-06 Thread Andrew E. Bruno
On Mon, Oct 05, 2015 at 02:48:48PM -0400, Rob Crittenden wrote:
> Andrew E. Bruno wrote:
> > On Mon, Oct 05, 2015 at 12:40:42PM +0200, Martin Kosek wrote:
> >> On 10/02/2015 06:00 PM, Andrew E. Bruno wrote:
> >>> On Fri, Oct 02, 2015 at 09:56:47AM -0400, Andrew E. Bruno wrote:
> >>>> What's the best way to re-initialize a replica? 
> >>>>
> >>>> Suppose one of your replicas goes south.. is there a command to tell
> >>>> that replicate to re-initialize from the first master (instead of
> >>>> removing/re-adding the replica from the topology)?
> >>>
> >>> Found the command I was looking for:
> >>>ipa-replica-manage re-initialize --from xxx
> >>>
> >>> However, one of our replicates is down and can't seem to re-initialize
> >>> it. Starting ipa fails (via systemctl restart ipa):
> >>>
> >>> ipactl status
> >>> Directory Service: RUNNING
> >>> krb5kdc Service: STOPPED
> >>> kadmin Service: STOPPED
> >>> named Service: STOPPED
> >>> ipa_memcached Service: STOPPED
> >>> httpd Service: STOPPED
> >>> pki-tomcatd Service: STOPPED
> >>> ipa-otpd Service: STOPPED
> >>> ipa: INFO: The ipactl command was successful
> >>>
> >>>
> >>> Errors from the dirsrv show:
> >>>
> >>> : GSSAPI Error: Unspecified GSS failure.  Minor code may provide more 
> >>> information (No Kerberos credentials available)) errno 0 (Success)
> >>> [02/Oct/2015:11:45:05 -0400] slapi_ldap_bind - Error: could not perform 
> >>> interactive bind for id [] authentication mechanism [GSSAPI]: error -2 
> >>> (Local error)
> >>> [02/Oct/2015:11:50:05 -0400] set_krb5_creds - Could not get initial 
> >>> credentials for principal [ldap/server@realm] in keytab 
> >>> [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
> >>> requested realm)
> >>> [02/Oct/2015:11:50:05 -0400] slapd_ldap_sasl_interactive_bind - Error: 
> >>> could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 
> >>> (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS 
> >>> failure.  Minor code may provide more information (No Kerberos 
> >>> credentials available)) errno 0 (Success)
> >>> [02/Oct/2015:11:50:05 -0400] slapi_ldap_bind - Error: could not perform 
> >>> interactive bind for id [] authentication mechanism [GSSAPI]: error -2 
> >>> (Local error)
> >>>
> >>>
> >>> Attempting to re-initialize fails:
> >>>
> >>> ipa-replica-manage re-initialize --from master
> >>> Connection timed out.
> >>>
> >>>
> >>> I verified time is in sync and DNS forward/reverse resolution is working.
> >>>
> >>> Any pointers on what else to try?
> >>>
> >>> Thanks!
> >>>
> >>> --Andrew
> >>
> >> Given that your Kerberos server instance is down, I would start 
> >> investigating
> >> Kerberos logs to see why.
> > 
> > 
> > So looks like the dirsrv service comes up but with GSS errors about kerb
> > credentials. However, the rest of the services including the krb5kdc
> > fail to come up. Errors from the kdc logs suggest DNS:
> 
> DS complaining about GSS is somewhat normal during startup as it is a
> bit noisy. The other errors suggest there is no data in the backend. An
> ldapsearch would confirm that.
> 
> > 
> >  LOOKING_UP_CLIENT: DNS/replica@REALM Server error
> > 
> > FreeIPA is configured to serve DNS and this replica resolves it's own
> > DNS in /etc/resolv.conf (127.0.0.1)
> > 
> > I tried pointing /etc/resolv.conf to another (good) replica and even
> > tried adjusting /etc/krb5.conf to point at another kdc to try and get a
> > ticket however it still tries to connect to the local kdc (which fails
> > to start).  
> > 
> > I'm inclined to re-install this replica and start fresh. I'm curious if
> > we can re-kickstart this host from a fresh os/freeipa install and run
> > the  ipa-replica-manage re-initialize --from master command. The replica
> > will have the same name.. is this possible? Would we need to backup the
> > /var/lib/ipa/replica-info-XXX.gpg file?
> 
> It needs to have its own principal in order to re-initialize. It sounds
> like it has nothing which is why replication is failing.
> 
> I'd recommend gen

Re: [Freeipa-users] re-initialize replica

2015-10-06 Thread Andrew E. Bruno
On Tue, Oct 06, 2015 at 12:53:04PM -0400, Mark Reynolds wrote:
> 
> 
> On 10/06/2015 10:30 AM, Andrew E. Bruno wrote:
> >On Tue, Oct 06, 2015 at 10:22:44AM -0400, Rob Crittenden wrote:
> >>Andrew E. Bruno wrote:
> >>>On Tue, Oct 06, 2015 at 09:35:08AM -0400, Rob Crittenden wrote:
> >>>>Andrew E. Bruno wrote:
> >>>>>The replica is not showing up when running ipa-replica-manage list.
> >>>>>
> >>>>>   # ipa-replica-manage list
> >>>>>   srv-m14-32.cbls.ccr.buffalo.edu: master
> >>>>>   srv-m14-31-02.cbls.ccr.buffalo.edu: master
> >>>>>
> >>>>>
> >>>>>However, still seeing the ruvs in ldapsearch:
> >>>>>
> >>>>>ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
> >>>>>objectClass=nsDS5ReplicationAgreement -LL
> >>>>>
> >>>>>
> >>>>>nsds50ruv: {replica 5 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 
> >>>>>55afec6b
> >>>>>  0005 55b2aa6800020005
> >>>>>
> >>>>>
> >>>>>..
> >>>>>
> >>>>>nsds50ruv: {replica 91 ldap://srv-m14-30.cbls.ccr.buffalo.edu:389} 
> >>>>>55afecb
> >>>>>  0005b 55b13e74005b
> >>>>>
> >>>>>
> >>>>>Should I clean these manually? or can I run: ipa-replica-manage 
> >>>>>clean-ruv 5
> >>>>>
> >>>>>Thanks again for the all the help.
> >>>>>
> >>>>>--Andrew
> >>>>>
> >>>>>
> >>>>Note that the list of masters comes from entries in IPA, not from
> >>>>replication agreements.
> >>>>
> >>>>ipa-replica-manage list-ruv will show the RUV data in a simpler way.
> >>>>
> >>>>Yeah, I'd use clean-ruv to clean them up.
> >>>>
> >>>>rob
> >>>>
> >>>>
> >>>I get an error trying to clean-ruv:
> >>>
> >>>   # ipa-replica-manage clean-ruv 5
> >>>   Replica ID 5 not found
> >>>
> >>>Can these safely be ignored? or will we hit problems when adding the
> >>>replica back in?
> >>ipa-replica-manage list-ruv will show you the current RUV list. If it
> >>isn't there then yeah, you're done.
> >>
> >>Having unused RUV in a master causes it to do unnecessary replication
> >>calculations.
> >>
> >>rob
> >Yes, list-ruv seems to show the correct RUV list.
> >
> ># ipa-replica-manage list-ruv
> >srv-m14-32.cbls.ccr.buffalo.edu:389: 4
> >srv-m14-31-02.cbls.ccr.buffalo.edu:389: 3
> >
> >It's just the ldapsearch that's showing repid 5 :
> >
> >ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" 
> >objectClass=nsDS5ReplicationAgreement -LL
> I think this can be ignored sicne its on the repl agreement, and not the
> backend.
> 
> What does this ldapsearch return:
> 
> replace -b with your suffix
> 
> ldapsearch -Y GSSAPI -b|"dc=example,dc=com" 
> '(&(nsuniqueid=---)(objectclass=nstombstone))'
> nsds50ruv|
> 
> 
> Mark


Here's the results of the above query:


dn: cn=replica,cn=dc\3Dcbls\2Cdc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tr
 ee,cn=config
nsds50ruv: {replicageneration} 55a955910004
nsds50ruv: {replica 4 ldap://srv-m14-32.cbls.ccr.buffalo.edu:389} 55a955fa
 0004 561400b300070004
nsds50ruv: {replica 3 ldap://srv-m14-31-02.cbls.ccr.buffalo.edu:389} 55a955960
 003 5613f7b500020003
nsds50ruv: {replica 5} 5600051d0015 5600051d0015


Still see that replica 5? Is that normal?

Thanks!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] re-initialize replica

2015-10-05 Thread Andrew E. Bruno
On Mon, Oct 05, 2015 at 12:40:42PM +0200, Martin Kosek wrote:
> On 10/02/2015 06:00 PM, Andrew E. Bruno wrote:
> > On Fri, Oct 02, 2015 at 09:56:47AM -0400, Andrew E. Bruno wrote:
> >> What's the best way to re-initialize a replica? 
> >>
> >> Suppose one of your replicas goes south.. is there a command to tell
> >> that replicate to re-initialize from the first master (instead of
> >> removing/re-adding the replica from the topology)?
> > 
> > Found the command I was looking for:
> >ipa-replica-manage re-initialize --from xxx
> > 
> > However, one of our replicates is down and can't seem to re-initialize
> > it. Starting ipa fails (via systemctl restart ipa):
> > 
> > ipactl status
> > Directory Service: RUNNING
> > krb5kdc Service: STOPPED
> > kadmin Service: STOPPED
> > named Service: STOPPED
> > ipa_memcached Service: STOPPED
> > httpd Service: STOPPED
> > pki-tomcatd Service: STOPPED
> > ipa-otpd Service: STOPPED
> > ipa: INFO: The ipactl command was successful
> > 
> > 
> > Errors from the dirsrv show:
> > 
> > : GSSAPI Error: Unspecified GSS failure.  Minor code may provide more 
> > information (No Kerberos credentials available)) errno 0 (Success)
> > [02/Oct/2015:11:45:05 -0400] slapi_ldap_bind - Error: could not perform 
> > interactive bind for id [] authentication mechanism [GSSAPI]: error -2 
> > (Local error)
> > [02/Oct/2015:11:50:05 -0400] set_krb5_creds - Could not get initial 
> > credentials for principal [ldap/server@realm] in keytab 
> > [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
> > requested realm)
> > [02/Oct/2015:11:50:05 -0400] slapd_ldap_sasl_interactive_bind - Error: 
> > could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 
> > (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS 
> > failure.  Minor code may provide more information (No Kerberos credentials 
> > available)) errno 0 (Success)
> > [02/Oct/2015:11:50:05 -0400] slapi_ldap_bind - Error: could not perform 
> > interactive bind for id [] authentication mechanism [GSSAPI]: error -2 
> > (Local error)
> > 
> > 
> > Attempting to re-initialize fails:
> > 
> > ipa-replica-manage re-initialize --from master
> > Connection timed out.
> > 
> > 
> > I verified time is in sync and DNS forward/reverse resolution is working.
> > 
> > Any pointers on what else to try?
> > 
> > Thanks!
> > 
> > --Andrew
> 
> Given that your Kerberos server instance is down, I would start investigating
> Kerberos logs to see why.


So looks like the dirsrv service comes up but with GSS errors about kerb
credentials. However, the rest of the services including the krb5kdc
fail to come up. Errors from the kdc logs suggest DNS:

 LOOKING_UP_CLIENT: DNS/replica@REALM Server error

FreeIPA is configured to serve DNS and this replica resolves it's own
DNS in /etc/resolv.conf (127.0.0.1)

I tried pointing /etc/resolv.conf to another (good) replica and even
tried adjusting /etc/krb5.conf to point at another kdc to try and get a
ticket however it still tries to connect to the local kdc (which fails
to start).  

I'm inclined to re-install this replica and start fresh. I'm curious if
we can re-kickstart this host from a fresh os/freeipa install and run
the  ipa-replica-manage re-initialize --from master command. The replica
will have the same name.. is this possible? Would we need to backup the
/var/lib/ipa/replica-info-XXX.gpg file?


--Andrew



-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


[Freeipa-users] re-initialize replica

2015-10-02 Thread Andrew E. Bruno
What's the best way to re-initialize a replica? 

Suppose one of your replicas goes south.. is there a command to tell
that replicate to re-initialize from the first master (instead of
removing/re-adding the replica from the topology)?


Thanks,

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] re-initialize replica

2015-10-02 Thread Andrew E. Bruno
On Fri, Oct 02, 2015 at 09:56:47AM -0400, Andrew E. Bruno wrote:
> What's the best way to re-initialize a replica? 
> 
> Suppose one of your replicas goes south.. is there a command to tell
> that replicate to re-initialize from the first master (instead of
> removing/re-adding the replica from the topology)?

Found the command I was looking for:
   ipa-replica-manage re-initialize --from xxx

However, one of our replicates is down and can't seem to re-initialize
it. Starting ipa fails (via systemctl restart ipa):

ipactl status
Directory Service: RUNNING
krb5kdc Service: STOPPED
kadmin Service: STOPPED
named Service: STOPPED
ipa_memcached Service: STOPPED
httpd Service: STOPPED
pki-tomcatd Service: STOPPED
ipa-otpd Service: STOPPED
ipa: INFO: The ipactl command was successful


Errors from the dirsrv show:

: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more 
information (No Kerberos credentials available)) errno 0 (Success)
[02/Oct/2015:11:45:05 -0400] slapi_ldap_bind - Error: could not perform 
interactive bind for id [] authentication mechanism [GSSAPI]: error -2 (Local 
error)
[02/Oct/2015:11:50:05 -0400] set_krb5_creds - Could not get initial credentials 
for principal [ldap/server@realm] in keytab [FILE:/etc/dirsrv/ds.keytab]: 
-1765328228 (Cannot contact any KDC for requested realm)
[02/Oct/2015:11:50:05 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local 
error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  
Minor code may provide more information (No Kerberos credentials available)) 
errno 0 (Success)
[02/Oct/2015:11:50:05 -0400] slapi_ldap_bind - Error: could not perform 
interactive bind for id [] authentication mechanism [GSSAPI]: error -2 (Local 
error)


Attempting to re-initialize fails:

ipa-replica-manage re-initialize --from master
Connection timed out.


I verified time is in sync and DNS forward/reverse resolution is working.

Any pointers on what else to try?

Thanks!

--Andrew


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


[Freeipa-users] ipa upgrade failed

2015-10-01 Thread Andrew E. Bruno
Running CentOS 7.1.1503.

Upgrading via yum update from:  

  ipa-server.x86_64 0:4.1.0-18.el7.centos.3

  --to--

  ipa-server.x86_64 0:4.1.0-18.el7.centos.4
  

We have 3 replicates. Upgrading the first replicate (first-master) went
fine. Upgrading the second replicate failed.

Got the following error during the update on the second replicate:

Pre schema upgrade failed with [Errno 111] Connection refused
IPA upgrade failed.

Where should I look for more info? Looked in the usual places and didn't
see anything out of the ordinary. How can I fix/verify the upgrade?

Thanks!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa upgrade failed

2015-10-01 Thread Andrew E. Bruno
On Thu, Oct 01, 2015 at 05:40:34PM +0200, Martin Basti wrote:
> 
> 
> On 10/01/2015 05:28 PM, Andrew E. Bruno wrote:
> >On Thu, Oct 01, 2015 at 05:09:23PM +0200, Martin Basti wrote:
> >>
> >>On 10/01/2015 05:03 PM, Andrew E. Bruno wrote:
> >>>Running CentOS 7.1.1503.
> >>>
> >>>Upgrading via yum update from:
> >>>
> >>>   ipa-server.x86_64 0:4.1.0-18.el7.centos.3
> >>>
> >>>   --to--
> >>>
> >>>   ipa-server.x86_64 0:4.1.0-18.el7.centos.4
> >>>
> >>>We have 3 replicates. Upgrading the first replicate (first-master) went
> >>>fine. Upgrading the second replicate failed.
> >>>
> >>>Got the following error during the update on the second replicate:
> >>>
> >>>Pre schema upgrade failed with [Errno 111] Connection refused
> >>>IPA upgrade failed.
> >>>
> >>>Where should I look for more info? Looked in the usual places and didn't
> >>>see anything out of the ordinary. How can I fix/verify the upgrade?
> >>>
> >>>Thanks!
> >>>
> >>>--Andrew
> >>>
> >>Hello,
> >>
> >>can you check /var/log/ipaupgrade.log and /var/log/dirsrv/slapd-*/errors for
> >>more specific errors?
> >Here's the errors from /var/log/dirsrv/slapd-*/errors right around the time 
> >of the upgrade:
> >
> >[01/Oct/2015:10:42:37 -0400] - slapd shutting down - signaling operation 
> >threads - op stack size 84 max work q size 59 max work q stack size 59
> >[01/Oct/2015:10:44:50 -0400] - Information: Non-Secure Port Disabled
> >[01/Oct/2015:10:44:50 -0400] - 389-Directory/1.3.3.1 B2015.218.023 starting 
> >up
> >[01/Oct/2015:10:44:50 -0400] - WARNING: changelog: entry cache size 512000B 
> >is less than db size 28639232B; We recommend to increase the entry cache 
> >size nsslapd-cachememsize.
> >[01/Oct/2015:10:44:50 -0400] - Detected Disorderly Shutdown last time 
> >Directory Server was running, recovering database.
> >...
> >tion.
> >[01/Oct/2015:10:45:03 -0400] NSMMReplicationPlugin - changelog program - 
> >_cl5NewDBFile: PR_DeleteSemaphore: 
> >/var/lib/dirsrv/slapd-INT-CCR-BUFFALO-EDU/cldb/9fc73292-2b0911e5-a53bd3a1-f3bbc58c.sema;
> > NSPR error - -5943
> >[01/Oct/2015:10:45:08 -0400] NSMMReplicationPlugin - changelog program - 
> >_cl5NewDBFile: PR_DeleteSemaphore: 
> >/var/lib/dirsrv/slapd-INT-CCR-BUFFALO-EDU/cldb/d0a7671e-2b0911e5-a53bd3a1-f3bbc58c.sema;
> > NSPR error - -5943
> >[01/Oct/2015:10:45:08 -0400] NSMMReplicationPlugin - 
> >replica_check_for_data_reload: Warning: disordely shutdown for replica 
> >o=ipaca. Check if DB RUV needs to be updated
> >[01/Oct/2015:10:45:08 -0400] set_krb5_creds - Could not get initial 
> >credentials for principal 
> >[ldap/srv-d13-39-02.int.ccr.buffalo@int.ccr.buffalo.edu] in keytab 
> >[FILE:/etc/dirsrv/ds.keytab]: -1765328324 (Generic error (see e-text))
> >[01/Oct/2015:10:45:08 -0400] NSMMReplicationPlugin - 
> >agmt="cn=cloneAgreement1-srv-d13-39-02.int.ccr.buffalo.edu-pki-tomcat" 
> >(srv-d13-38-02:389): Unable to acquire replica: the replica instructed us to 
> >go into backoff mode. Will retry later.
> >[01/Oct/2015:10:45:08 -0400] NSMMReplicationPlugin - 
> >replica_check_for_data_reload: Warning: disordely shutdown for replica 
> >dc=int,dc=ccr,dc=buffalo,dc=edu. Check if DB RUV needs to be updated
> >[01/Oct/2015:10:45:08 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
> >not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local 
> >error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  
> >Minor code may provide more information (No Kerberos credentials available)) 
> >errno 0 (Success)
> >...
> >[01/Oct/2015:10:45:09 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 22231 (rc: 32)
> >[01/Oct/2015:10:45:09 -0400] - slapd started.  Listening on 
> >/var/run/slapd-INT-CCR-BUFFALO-EDU.socket for LDAPI requests
> >[01/Oct/2015:10:45:09 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 22232 (rc: 32)
> >[01/Oct/2015:10:45:09 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 22233 (rc: 32)
> >[01/Oct/2015:10:45:09 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 22234 (rc: 32)
> >[01/Oct/2015:10:45:09 -0400] DSRetroclPlugin - delete_changerecord: could 
> >not delete change record 22235 (rc: 32)
> >[01/Oct/2015:10:45:09 -

[Freeipa-users] replicas unresponsive with increasing file descriptors

2015-09-01 Thread Andrew E. Bruno
A few months ago we had a replica failure where the system ran out of file
descriptors and the slapd database was corrupted:

https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html

We now monitor file descriptor counts on our replicas and last night we
had 2 of our 3 replicas fail and become completely unresponsive. Trying
to kinit on the replica resulted in:

[user@ipa-master]$ kinit
kinit: Generic error (see e-text) while getting initial credentials


Snippet from the /var/log/dirsrv/slapd-[domain]/errors:

[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: 
Attempting to release replica, but unable to receive endReplication extended 
operation response from the replica. Error -5 (Timed out)
[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.
[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.
[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.
[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.
[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.
[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental 
protocol: event backoff_timer_expired should not occur in state start_backoff
[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.
[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - 
agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
receive the response for a startReplication extended operation to consumer 
(Timed out). Will retry later.

The access logs were filling up with:

[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 
10.106.14.29 to 10.113.14.30
[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 
10.104.9.137 to 10.113.14.30
[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 
10.104.16.19 to 10.113.14.30
[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 
10.111.11.30 to 10.113.14.30
[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 
10.113.27.115 to 10.113.14.30
[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 
10.111.8.116 to 10.113.14.30
[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 
10.113.25.40 to 10.113.14.30
[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 
10.106.14.27 to 10.113.14.30
[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 
10.111.10.141 to 10.113.14.30
[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 
10.104.14.27 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 
10.106.13.132 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 
10.113.25.11 to 10.113.14.30
[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 
10.104.15.11 to 10.113.14.30
[31/Aug/2015:17:13:32 -0400] conn=1386003 fd=533 slot=533 connection from 
10.104.7.136 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386004 fd=534 slot=534 connection from 
10.113.24.23 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386005 fd=535 slot=535 connection from 
10.106.12.105 to 10.113.14.30
[31/Aug/2015:17:13:33 -0400] conn=1386006 fd=536 slot=536 connection from 
10.104.16.41 to 10.113.14.30
[31/Aug/2015:17:13:34 -0400] conn=1386007 fd=537 slot=537 connection from 
10.104.16.4 to 10.113.14.30
[31/Aug/2015:17:13:35 -0400] conn=1386008 fd=538 slot=538 connection from 
10.111.8.12 to 10.113.14.30
[31/Aug/2015:17:13:36 -0400] conn=1386009 fd=539 slot=539 connection from 
10.111.8.17 to 10.113.14.30



Seems like clients were connecting to the replicas but file descriptors were
not getting released. Our monitoring showed increasing file descriptor counts
on both replicas (the FD counts are normally ~600):


Re: [Freeipa-users] replicas unresponsive with increasing file descriptors

2015-09-01 Thread Andrew E. Bruno
On Tue, Sep 01, 2015 at 05:03:10PM +0200, Ludwig Krispenz wrote:
> 
> On 09/01/2015 04:39 PM, Andrew E. Bruno wrote:
> >A few months ago we had a replica failure where the system ran out of file
> >descriptors and the slapd database was corrupted:
> >
> >https://www.redhat.com/archives/freeipa-users/2015-June/msg00389.html
> >
> >We now monitor file descriptor counts on our replicas and last night we
> >had 2 of our 3 replicas fail and become completely unresponsive. Trying
> >to kinit on the replica resulted in:
> >
> >[user@ipa-master]$ kinit
> >kinit: Generic error (see e-text) while getting initial credentials
> >
> >
> >Snippet from the /var/log/dirsrv/slapd-[domain]/errors:
> >
> >[31/Aug/2015:17:14:39 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Warning: 
> >Attempting to release replica, but unable to receive endReplication extended 
> >operation response from the replica. Error -5 (Timed out)
> >[31/Aug/2015:17:16:39 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:18:42 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:20:42 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:22:47 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:24:47 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Incremental 
> >protocol: event backoff_timer_expired should not occur in state start_backoff
> >[31/Aug/2015:17:26:50 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >[31/Aug/2015:17:28:50 -0400] NSMMReplicationPlugin - 
> >agmt="cn=meTosrv-m14-30.cbls.ccr.buffalo.edu" (srv-m14-30:389): Unable to 
> >receive the response for a startReplication extended operation to consumer 
> >(Timed out). Will retry later.
> >
> >The access logs were filling up with:
> >
> >[31/Aug/2015:17:13:17 -0400] conn=1385990 fd=449 slot=449 connection from 
> >10.106.14.29 to 10.113.14.30
> >[31/Aug/2015:17:13:18 -0400] conn=1385991 fd=450 slot=450 connection from 
> >10.104.9.137 to 10.113.14.30
> >[31/Aug/2015:17:13:18 -0400] conn=1385992 fd=451 slot=451 connection from 
> >10.104.16.19 to 10.113.14.30
> >[31/Aug/2015:17:13:21 -0400] conn=1385993 fd=452 slot=452 connection from 
> >10.111.11.30 to 10.113.14.30
> >[31/Aug/2015:17:13:24 -0400] conn=1385994 fd=453 slot=453 connection from 
> >10.113.27.115 to 10.113.14.30
> >[31/Aug/2015:17:13:27 -0400] conn=1385995 fd=454 slot=454 connection from 
> >10.111.8.116 to 10.113.14.30
> >[31/Aug/2015:17:13:27 -0400] conn=1385996 fd=514 slot=514 connection from 
> >10.113.25.40 to 10.113.14.30
> >[31/Aug/2015:17:13:29 -0400] conn=1385997 fd=515 slot=515 connection from 
> >10.106.14.27 to 10.113.14.30
> >[31/Aug/2015:17:13:29 -0400] conn=1385998 fd=516 slot=516 connection from 
> >10.111.10.141 to 10.113.14.30
> >[31/Aug/2015:17:13:30 -0400] conn=1385999 fd=528 slot=528 connection from 
> >10.104.14.27 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386000 fd=529 slot=529 connection from 
> >10.106.13.132 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386001 fd=530 slot=530 connection from 
> >10.113.25.11 to 10.113.14.30
> >[31/Aug/2015:17:13:31 -0400] conn=1386002 fd=531 slot=531 connection from 
> >10.104.15.11 to 10.113.14.30
> >[31/Aug

Re: [Freeipa-users] dnssec support in 4.1

2015-07-22 Thread Andrew E. Bruno
On Wed, Jul 22, 2015 at 04:48:33PM +0300, Alexander Bokovoy wrote:
 On Wed, 22 Jul 2015, Andrew E. Bruno wrote:
 Apologies if this has been answered before but we're interested in
 dnssec support in FreeIPA.  Running Centos 7.1.1503, ipa-server 4.1.0-18
 and following the docs here:
 https://www.freeipa.org/page/Howto/DNSSEC
 
 and
 
 http://www.freeipa.org/page/Releases/4.1.0#DNSSEC_Support
 
 # ipa-dns-install --dnssec-master
 Usage: ipa-dns-install [options]
 
 ipa-dns-install: error: no such option: --dnssec-master
 
 
 Is this not supported in 4.1.0?  If not, is there a manual way to get
 zone signing to work?
 DNSSEC support is switched off in RHEL 7.1 (and CentOS 7.1) but is
 available in Fedora 21+/upstream bits.
 
 We plan to bring DNSSEC support to next RHEL 7 update, thanks to
 stabilization work done after RHEL 7.1 release.

Sounds great. Thanks. Looking forward to the next update. 

 -- 
 / Alexander Bokovoy
 
 

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-17 Thread Andrew E. Bruno
On Thu, Jul 16, 2015 at 10:32:59AM +0200, Ludwig Krispenz wrote:
 Thank you for the data, I think I understand now what is going on.
 
 In the error logs we see only message like (from my test env):
 
 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300
 (localhost:9759): replay_update: modifys operation (dn=dc=example,dc=com
 csn=55a82a2900010064) not sent - empty
 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300
 (localhost:9759): replay_update: Consumer successfully sent operation with
 csn 55a82a2900010064
 [16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt=cn=100-300
 (localhost:9759): Skipping update operation with no message_id (uniqueid
 7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a2900010064):
 
 This happens if fractional replication is configured as IPA does and the
 changes affect only attributes which will NOT be replicated. So teh local
 RUV will be updated, but since no change is really sent, the consumer RUV is
 not updated and replciation will always set off from an very old starting
 csn. It is a rare scenario where a server receives only mods which are not
 replicated.
 
 I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225
 
 As  a workaround can you try to apply a mod on m14-26 which will not be
 stripped, either create a dummy user or add a description attribute to an
 existing object. Repliciation will once again iterate thru all the changes
 (which can take a while), but then should replay this latest change and
 define a new offset
 

Excellent. I can confirm your workaround fixed the issue. I updated a
users email address (on m14-26) and the load came down back to normal
within a few minutes. 

Thanks very much for all your help debugging this.

Best,

--Andrew



 Regards,
 Ludwig
 
 

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-15 Thread Andrew E. Bruno
On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote:
 
 On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:
 On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:
 On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:
 On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
 hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, 
 and
 Jul,7th - so it looks like it is iterating the changelog over and over
 again.
 Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the 
 master
 ?
 
 can you provide the result of the following search from
 m14-24.ccr.buffalo.edu adn the server with the high cpu:
 
 ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
 objectclass=nsds5replica nsds50ruv
 master is srv-m14-24.. here's the results of the ldapsearch:
 
 [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W 
  -b cn=config objectclass=nsds5replica nsds50ruv
 
 # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
 tree,cn=config
 nsds50ruv: {replicageneration} 5527f7110004
 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
 5527f7710004 55a55aed0014
 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
 5537c7730005 5591a3d200070005
 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
 55943dda0006 5594537800020006
 so this is really strange, the master m14-24 has the latest change from
 replica 5(m14-26) as: 5591a3d200070005
 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
 so no update from 14-24 since that did arrive, or could not update the ruv.
 So m14-26 tries to replicate all the changes back from that time, but looks
 like iit has no success.
 is there anything in the logs of m14-24 ? can you see successful mods with
 csn=xxx0005 ?
 Here's what I could find from the logs on srv-m14-24:
 
 
 [srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/*
 access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 
 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054
 ok, so no update originating at replica 5 has been replicated (probably
 since June,29) did you experience data inconsistency between the servers ?
 
 
 And here's the last few lines the error log on srv-m14-24:
 one set of messages refers to the o=ipaca backend and seem to be transient,
 replication continues later.
 the other set of msg No original tombstone .. is annoying (and it is fixed
 in ticket https://fedorahosted.org/389/ticket/47912)
 
 the next thing we can do to try to understand what is going on is to enable
 replication logging on m14-26, it will then not only consume all cpu, but
 write tons of messages to the error log.
 But it can be turned on and off:
 
 ldapmodify ...
 dn: cn=config
 replace: nsslapd-errorlog-level
 nsslapd-errorlog-level: 8192
 
 and let it run for a while, then set it back to: 0

I enabled replication logging and it's running now. I noticed the
default value for nsslapd-errorlog-level was set to 16384 (not 0). 

OK to send you the logs off list? Looks like they contain quite a bit of
sensitive data.

Thanks again for all the help looking into this.

Best,

--Andrew



 
 
 [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
 original_tombstone for changenumber=2456070,cn=changelog!!
 [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
 original_tombstone for changenumber=2498441,cn=changelog!!
 [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations 
 error (1). Will retry later.
 [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - 
 agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
 (srv-m14-26:389): Consumer failed to replay change (uniqueid 
 cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-15 Thread Andrew E. Bruno
On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:
 
 On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:
 On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
 hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
 Jul,7th - so it looks like it is iterating the changelog over and over
 again.
 Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
 ?
 
 can you provide the result of the following search from
 m14-24.ccr.buffalo.edu adn the server with the high cpu:
 
 ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
 objectclass=nsds5replica nsds50ruv
 
 master is srv-m14-24.. here's the results of the ldapsearch:
 
 [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  
 -b cn=config objectclass=nsds5replica nsds50ruv
 
 # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
 tree,cn=config
 nsds50ruv: {replicageneration} 5527f7110004
 nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
 5527f7710004 55a55aed0014
 nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
 5537c7730005 5591a3d200070005
 nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
 55943dda0006 5594537800020006
 so this is really strange, the master m14-24 has the latest change from
 replica 5(m14-26) as: 5591a3d200070005
 which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
 so no update from 14-24 since that did arrive, or could not update the ruv.
 So m14-26 tries to replicate all the changes back from that time, but looks
 like iit has no success.
 is there anything in the logs of m14-24 ? can you see successful mods with
 csn=xxx0005 ?

Here's what I could find from the logs on srv-m14-24:


[srv-m14-24 ~]# grep -r 0005 /var/log/dirsrv/slapd-[domain]/*
access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT 
err=0 tag=103 nentries=0 etime=1 csn=55a4b5f00054


And here's the last few lines the error log on srv-m14-24:

[12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2456070,cn=changelog!!
[12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2498441,cn=changelog!!
[13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a4060060): Operations 
error (1). Will retry later.
[13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca0060): Operations 
error (1). Will retry later.
[13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f20060): Operations 
error (1). Will retry later.
[13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a411020060): Operations 
error (1). Will retry later.
[13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b320060): Operations 
error (1). Will retry later.
[13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a0060): Operations 
error (1). Will retry later.
[13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad60060): Operations 
error (1). Will retry later.
[13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f60060): Operations 
error (1). Will retry later.
[14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - 
agmt=cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa0060): Operations 
error (1). Will retry later.
[14/Jul/2015:09:56:52 -0400

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-14 Thread Andrew E. Bruno
On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
 hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
 Jul,7th - so it looks like it is iterating the changelog over and over
 again.
 Th consumer side Is cn=meTosrv-m14-24.ccr.buffalo.edu - is this the master
 ?
 
 can you provide the result of the following search from
 m14-24.ccr.buffalo.edu adn the server with the high cpu:
 
 ldapsearch -o ldif-wrap=no -x -D ... -w  -b cn=config
 objectclass=nsds5replica nsds50ruv


master is srv-m14-24.. here's the results of the ldapsearch:

[srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a55aed0014
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 5591a3d200070005
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7540060 55a557f60060
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e0056 55943e6f00010056
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba005b 5582c7e40004005b


server with high cpu load is srv-m14-26. here's the results of the ldapsearch
from this server:

[srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D cn=directory manager -W  -b 
cn=config objectclass=nsds5replica nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7730005 55a55b4700030005
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7710004 55a53eba0004
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda0006 5594537800020006

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba005b 5582c7e40004005b
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f7540060 55a557f60060
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e0056 55943e6f00010056


srv-m14-25-02 is our 3rd replicate which we recently added back in after it
failed (was added back in 7/1).

Let me know if you need anything else. Thanks for the help.

--Andrew

 
 On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:
 On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:
 On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:
 On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
 On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
 On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
 can you get a pstack of the slapd process along with a top -H to find th
 ethread with high cpu usage
 Attached is the full stacktrace of the running ns-slapd proccess. top -H
 shows this thread (2879) with high cpu usage:
 
 2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 
 ns-slapd
 this thread is a replication thread sending updates, what is strange is 
 that
 the current csn_str is quite old (july, 7th), I can't tell which 
 agreeement
 this thread is handling, but looks like it is heavily reading the 
 changeglog
 and sending updates. anything changed recently in replication setup ?
 Yes, we had one replica fail on (6/19) which we removed (not this one
 showing high CPU load). Had to perform some manual cleanup of the ipa-ca
 RUVs. Then we added the replica back in on 7/1. Since then, replication
 appears to have been running normally between the 3 replicas. We've been
 monitoring utilization since 7/1 and only recently seen this spike (past
 24 hours or so).
 is it still in this state ? or was it a spike.
 Yes same state.
 
 if it still is high cpu consuming, could you
 - get a few pstack like the one before with some time in between, I would
 like to see if it is progressing with the csns or looping on the same one
 Attached are a few stacktraces. The thread pegging the cpu is:
 
 PID  USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+  COMMAND
 2879 dirsrv20   0 3819252 1.978g  11684 R  99.9  3.2  10148:26  ns-slapd
 
 - check the consumer side. is there anything in the error log ? does the
 access log show

[Freeipa-users] ns-slapd high cpu usage

2015-07-13 Thread Andrew E. Bruno
We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
389-ds 1.3.3.1-16. 

Recently, the ns-slapd process on one of our replicas started showing higher
than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
constantly.

Seems very similar to this thread:
https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html

There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
sure if these are related):


[13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2277387,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:04:11:50 -0400] - Operation error fetching 
changenumber=2277387,cn=changelog (null), error -30993.
[13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2277387, dn = changenumber=2277387,cn=changelog: 
Operations error. 
[13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
(01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
[13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2281464,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:07:41:49 -0400] - Operation error fetching 
changenumber=2281464,cn=changelog (null), error -30993.
[13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2281464, dn = changenumber=2281464,cn=changelog: 
Operations error. 
[13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]


access logs seem to be showing normal activity. Here's the number of open
connections:

# ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
62

Note: the other two replicas have much higher open connections (250) and low
cpu load avgs.

Here's some output of logconv.pl from our most recent access log on the replica
with high cpu load:

Start of Logs:13/Jul/2015:04:49:18
End of Logs:  13/Jul/2015:10:06:11

Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds

Restarts: 0
Total Connections:2343
 - LDAP Connections:  2120
 - LDAPI Connections: 223
 - LDAPS Connections: 0
 - StartTLS Extended Ops: 45
 Secure Protocol Versions:
  - TLS1.2 128-bit AES - 45

Peak Concurrent Connections:  22
Total Operations: 111865
Total Results:111034
Overall Performance:  99.3%

Searches: 95585 (5.03/sec)  (301.64/min)
Modifications:3369  (0.18/sec)  (10.63/min)
Adds: 0 (0.00/sec)  (0.00/min)
Deletes:  0 (0.00/sec)  (0.00/min)
Mod RDNs: 0 (0.00/sec)  (0.00/min)
Compares: 0 (0.00/sec)  (0.00/min)
Binds:7082  (0.37/sec)  (22.35/min)

Proxied Auth Operations:  0
Persistent Searches:  0
Internal Operations:  0
Entry Operations: 0
Extended Operations:  5317
Abandoned Requests:   416
Smart Referrals Received: 0

VLV Operations:   96
VLV Unindexed Searches:   0
VLV Unindexed Components: 32
SORT Operations:  64

Entire Search Base Queries:   0
Paged Searches:   3882
Unindexed Searches:   0
Unindexed Components: 5

FDs Taken:2566
FDs Returned: 2643
Highest FD Taken: 249

Broken Pipes: 0
Connections Reset By Peer:0
Resource Unavailable: 0
Max BER Size Exceeded:0

Binds:7082
Unbinds:  2443
 - LDAP v2 Binds: 0
 - LDAP v3 Binds: 6859
 - AUTOBINDs: 223
 - SSL Client Binds:  0
 - Failed SSL Client Binds:   0
 - SASL Binds:6814
GSSAPI - 6591
EXTERNAL - 223
 - Directory Manager Binds:   0
 - Anonymous Binds:   6591
 - Other Binds:   491




strace timing on the ns-slapd process:


% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 94.400.346659597758   poll
  4.100.015057   15057 1   restart_syscall
  0.910.003353  575959 getpeername
  0.490.001796 15012   futex
  0.100.000364  73 5   read
-- --- --- - - 
100.000.367229   13559 total


top output (with threads 'H'):

  PID USER  PR  NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND  


 

Re: [Freeipa-users] ns-slapd high cpu usage

2015-07-13 Thread Andrew E. Bruno
On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
 
 On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
 On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
 can you get a pstack of the slapd process along with a top -H to find th
 ethread with high cpu usage
 Attached is the full stacktrace of the running ns-slapd proccess. top -H
 shows this thread (2879) with high cpu usage:
 
 2879 dirsrv20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
 this thread is a replication thread sending updates, what is strange is that
 the current csn_str is quite old (july, 7th), I can't tell which agreeement
 this thread is handling, but looks like it is heavily reading the changeglog
 and sending updates. anything changed recently in replication setup ?


Yes, we had one replica fail on (6/19) which we removed (not this one
showing high CPU load). Had to perform some manual cleanup of the ipa-ca
RUVs. Then we added the replica back in on 7/1. Since then, replication
appears to have been running normally between the 3 replicas. We've been
monitoring utilization since 7/1 and only recently seen this spike (past
24 hours or so). 

On a side note, we get hit with this bug often:

https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html

(rouge sssd_be processing hammering a replica). 

This causes high ns-slapd utilization on the replica and restarting sssd
on the client host immediately fixes the issue. However, in this
case, we're not seeing this behavior.




 
 
 
 
 
 On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:
 We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
 389-ds 1.3.3.1-16.
 
 Recently, the ns-slapd process on one of our replicas started showing 
 higher
 than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
 constantly.
 
 Seems very similar to this thread:
 https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html
 
 There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors 
 (not
 sure if these are related):
 
 
 [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation 
 failure [1]
 [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
 changenumber=2277387,cn=changelog from entryrdn index (-30993)
 [13/Jul/2015:04:11:50 -0400] - Operation error fetching 
 changenumber=2277387,cn=changelog (null), error -30993.
 [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured 
 while adding change number 2277387, dn = 
 changenumber=2277387,cn=changelog: Operations error.
 [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation 
 failure [1]
 [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
 (01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
 [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
 changenumber=2281464,cn=changelog from entryrdn index (-30993)
 [13/Jul/2015:07:41:49 -0400] - Operation error fetching 
 changenumber=2281464,cn=changelog (null), error -30993.
 [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured 
 while adding change number 2281464, dn = 
 changenumber=2281464,cn=changelog: Operations error.
 [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation 
 failure [1]
 
 
 access logs seem to be showing normal activity. Here's the number of open
 connections:
 
 # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc 
 -l
 62
 
 Note: the other two replicas have much higher open connections (250) and 
 low
 cpu load avgs.
 
 Here's some output of logconv.pl from our most recent access log on the 
 replica
 with high cpu load:
 
 Start of Logs:13/Jul/2015:04:49:18
 End of Logs:  13/Jul/2015:10:06:11
 
 Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds
 
 Restarts: 0
 Total Connections:2343
   - LDAP Connections:  2120
   - LDAPI Connections: 223
   - LDAPS Connections: 0
   - StartTLS Extended Ops: 45
   Secure Protocol Versions:
- TLS1.2 128-bit AES - 45
 
 Peak Concurrent Connections:  22
 Total Operations: 111865
 Total Results:111034
 Overall Performance:  99.3%
 
 Searches: 95585 (5.03/sec)  (301.64/min)
 Modifications:3369  (0.18/sec)  (10.63/min)
 Adds: 0 (0.00/sec)  (0.00/min)
 Deletes:  0 (0.00/sec)  (0.00/min)
 Mod RDNs: 0 (0.00/sec)  (0.00/min)
 Compares: 0 (0.00/sec)  (0.00/min)
 Binds:7082  (0.37/sec)  (22.35/min)
 
 Proxied Auth Operations:  0
 Persistent Searches:  0
 Internal Operations:  0
 Entry Operations: 0
 Extended Operations:  5317
 Abandoned Requests:   416
 Smart Referrals Received: 0
 
 VLV Operations:   96
 VLV Unindexed

Re: [Freeipa-users] dirsrv access logs flooded from single connection id

2015-07-02 Thread Andrew E. Bruno
On Thu, Jul 02, 2015 at 11:04:00AM +0200, thierry bordaz wrote:
 On 06/29/2015 06:34 PM, Andrew E. Bruno wrote:
 On Mon, Jun 29, 2015 at 10:29:24AM -0600, Rich Megginson wrote:
 On 06/29/2015 10:13 AM, Andrew E. Bruno wrote:
 Our dirsrv access logs on our freeipa master server are getting flooded
 with this:
 
 [29/Jun/2015:12:02:09 -0400] conn=215758 op=1355326784 SRCH
 base=cn=u2,cn=groups,cn=accounts,dc=ccr,dc=buffalo,dc=edu scope=0
 filter=(objectClass=*) attrs=objectClass posixgroup cn userPassword
 gidNumber member ipaNTSecurityIdentifier modifyTimestamp entryusn uid
 
 [29/Jun/2015:12:08:08 -0400] conn=215758 op=1356545457 RESULT err=0
 tag=101 nentries=0 etime=0 notes=P
 
 All from the same conn=215758. Logs get rotated every minute.
 
 logconv.pl is showing
 
 Searches: 265803(3322.54/sec) (199352.25/min)
 
 
 How can I figure out which ip address this query is coming from? Is
 there a way to fetch the ip using the connection id? conn=215758?
 grep conn=215758 fd= /var/log/dirsrv/slapd-INST/access*
 
 Unfortunately, if it has been rotated away, you won't be able to get the
 information from the access log.
 
 No luck .. looks like it has been rotated away. Any other thoughts?
 
 Is it correct to assume this is all coming from a single host? My
 thinking is that if I can kill the query coming from the host that it
 would solve the problem.
 
 Hi,
 
 This is looking like bug https://fedorahosted.org/389/ticket/48192.
 Because a ldap client (likely SSSD ?) keeps sending page results requests
 although 0 entries are returned.
 A condition for this is that the search has been abandonned but it is
 difficult to very this as the log file has rotated.
 
 This is fixed in 6.7 and 7.1.z
 

Thanks Thierry. We get bit by this sporadically every few days.
Load/network traffic from ns-slapd spikes up on one of our FreeIPA
servers. Access logs getting hammered from a single host. We then login
to the host in question and see the sssd_be process consuming higher
than avg cpu load (20%). netstat shows two connections to the FreeIPA
server:

tcp  0  456   sssd_be
tcp  00   sssd_be


Simply restarting sssd on the host always fixes the problem. We tried
backtracing the sssd_be process when it's in this state but didn't see
any obvious clues:

(gdb) bt
#0  0x0036b5ee8ed3 in __epoll_wait_nocancel () at 
../sysdeps/unix/syscall-template.S:82
#1  0x0036bae08d9c in epoll_event_loop (ev=value optimized out, 
location=value optimized out) at ../tevent_epoll.c:650
#2  epoll_event_loop_once (ev=value optimized out, location=value optimized 
out) at ../tevent_epoll.c:931
#3  0x0036bae072e6 in std_event_loop_once (ev=0xd86590, 
location=0x36be23d4f5 src/util/server.c:602) at ../tevent_standard.c:112
#4  0x0036bae0349d in _tevent_loop_once (ev=0xd86590, location=0x36be23d4f5 
src/util/server.c:602) at ../tevent.c:530
#5  0x0036bae0351b in tevent_common_loop_wait (ev=0xd86590, 
location=0x36be23d4f5 src/util/server.c:602) at ../tevent.c:634
#6  0x0036bae07256 in std_event_loop_wait (ev=0xd86590, 
location=0x36be23d4f5 src/util/server.c:602) at ../tevent_standard.c:138
#7  0x0036be22b963 in server_loop (main_ctx=0xd87900) at 
src/util/server.c:602
#8  0x0040a286 in main (argc=4, argv=value optimized out) at 
src/providers/data_provider_be.c:2856


Could this also be an issue with sssd? Is it normal for the sssd process
to keep sending page results requests like this? So far this behavior
has only happened on centos 6.6 clients running sssd-1.11.6-30. 

Would setting a low connection timeout on the dirsrv help to alleviate
this problem until the patches from #48192 make it into mainstream
centos 7.1 updates?

Many thanks!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-07-01 Thread Andrew E. Bruno
On Thu, Jun 25, 2015 at 05:40:23PM -0400, Andrew E. Bruno wrote:
 On Mon, Jun 22, 2015 at 12:49:01PM -0400, Rob Crittenden wrote:
  
  You aren't seeing a replication agreement. You're seeing the Replication
  Update Vector (RUV).
  
  See 
  http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
  
  You need to do something like:
  
  # ldapmodify -D cn=directory manager -W -a
  dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
  objectclass: extensibleObject
  replica-base-dn: o=ipaca
  replica-id: 97
  cn: clean 97
  
  
  Great, thanks for the clarification.
  
  Curious what's the difference between running the ldapmodify above and
  ipa-replica-manage clean-ruv?
  
  
  Nothing, for the IPA data. This is a remanant from a CA replication
  agreement and it was an oversight not to add similar RUV management options
  to the ipa-careplica-manage tool.
  
 
 I'm still seeing some inconsistencies. Forgive me if I'm mis-interpreting any
 of this output (still learning the ropes with FreeIPA here)..
 
 Just trying to wrap my head around the RUVs. Trying to follow the docs here:
 http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
 
 And after running the ldapsearch command to check for obsolete masters
 I'm not seeing the replica ID for the old replica we deleted (rep2):
 
 
 $  ldapsearch -xLLL -D cn=directory manager -W -s sub -b cn=config 
 objectclass=nsds5replica
 Enter LDAP Password: 
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
 cn: replica
 nsDS5Flags: 1
 objectClass: nsds5replica
 objectClass: top
 objectClass: extensibleobject
 nsDS5ReplicaType: 3
 nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
 nsds5ReplicaLegacyConsumer: off
 nsDS5ReplicaId: 4
 nsDS5ReplicaBindDN: cn=replication manager,cn=config
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsState:: BABIa4xVJAABAA==
 nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
 nsds5ReplicaChangeCount: 1687559
 nsds5replicareapactive: 0
 
 dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
 objectClass: top
 objectClass: nsDS5Replica
 objectClass: extensibleobject
 nsDS5ReplicaRoot: o=ipaca
 nsDS5ReplicaType: 3
 nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep2
  falo.edu-pki-tomcat,ou=csusers,cn=config
 nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep3
  falo.edu-pki-tomcat,ou=csusers,cn=config
 cn: replica
 nsDS5ReplicaId: 96
 nsDS5Flags: 1
 nsState:: YAAPa4xVAAkACgABAA==
 nsDS5ReplicaName: c458be8e-df9c11e4-a351aa45-2e06257b
 nsds5ReplicaChangeCount: 9480
 nsds5replicareapactive: 0
 
 
 I see: 
 
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
 tree,cn=config)
 nsds5replicaid: 4
 
 and 
 
 dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
 nsDS5ReplicaId: 96
 
 
 In the above output I only see the old replica showing up under:
 
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA...
 
 According to the docs I need the nsds5replicaid for use in the CLEANALLRUV
 task? 
 
 I also checked the RUV tombstone entry as per the docs:
 
 # ldapsearch -xLLL -D cn=directory manager -W -b dc=ccr,dc=buffalo,dc=edu 
 '((nsuniqueid=---)(objectclass=nstombstone))'
 Enter LDAP Password: 
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
 cn: replica
 nsDS5Flags: 1
 objectClass: nsds5replica
 objectClass: top
 objectClass: extensibleobject
 nsDS5ReplicaType: 3
 nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
 nsds5ReplicaLegacyConsumer: off
 nsDS5ReplicaId: 4
 nsDS5ReplicaBindDN: cn=replication manager,cn=config
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsState:: BADycYxVJAABAA==
 nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
 nsds50ruv: {replicageneration} 5527f7110004
 nsds50ruv: {replica 4 ldap://rep1:389} 5527f77100040
  000 558c722800020004
 nsds50ruv: {replica 5 ldap://rep3:389} 5537c77300050
  000 5582c7f600060005
 nsds5agmtmaxcsn: 
 dc=ccr,dc=buffalo,dc=edu;meTorep3;rep3;389;5;558c572b000a0004
 nsruvReplicaLastModified: {replica 4 ldap://rep1:389} 55
  8c7204
 nsruvReplicaLastModified: {replica 5 ldap://rep3:389} 00
  00
 nsds5ReplicaChangeCount: 1689129
 nsds5replicareapactive: 0
 
 And only see nsds50ruv attributes for rep1, and rep3. However, still seeing
 rep2 in the nsDS5ReplicaBindDN.
 
 If I'm parsing this output correct, it appears RUVs for rep2 is already
 cleaned? If so, how come the nsDS5ReplicaBindDN still exist? 
 
 Also, why

[Freeipa-users] dirsrv access logs flooded from single connection id

2015-06-29 Thread Andrew E. Bruno
Our dirsrv access logs on our freeipa master server are getting flooded
with this:

[29/Jun/2015:12:02:09 -0400] conn=215758 op=1355326784 SRCH
base=cn=u2,cn=groups,cn=accounts,dc=ccr,dc=buffalo,dc=edu scope=0
filter=(objectClass=*) attrs=objectClass posixgroup cn userPassword
gidNumber member ipaNTSecurityIdentifier modifyTimestamp entryusn uid

[29/Jun/2015:12:08:08 -0400] conn=215758 op=1356545457 RESULT err=0
tag=101 nentries=0 etime=0 notes=P

All from the same conn=215758. Logs get rotated every minute. 

logconv.pl is showing

Searches: 265803(3322.54/sec) (199352.25/min)


How can I figure out which ip address this query is coming from? Is
there a way to fetch the ip using the connection id? conn=215758?

Thanks in advance.

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] dirsrv access logs flooded from single connection id

2015-06-29 Thread Andrew E. Bruno
On Mon, Jun 29, 2015 at 12:34:25PM -0400, Andrew E. Bruno wrote:
 On Mon, Jun 29, 2015 at 10:29:24AM -0600, Rich Megginson wrote:
  On 06/29/2015 10:13 AM, Andrew E. Bruno wrote:
  Our dirsrv access logs on our freeipa master server are getting flooded
  with this:
  
  [29/Jun/2015:12:02:09 -0400] conn=215758 op=1355326784 SRCH
  base=cn=u2,cn=groups,cn=accounts,dc=ccr,dc=buffalo,dc=edu scope=0
  filter=(objectClass=*) attrs=objectClass posixgroup cn userPassword
  gidNumber member ipaNTSecurityIdentifier modifyTimestamp entryusn uid
  
  [29/Jun/2015:12:08:08 -0400] conn=215758 op=1356545457 RESULT err=0
  tag=101 nentries=0 etime=0 notes=P
  
  All from the same conn=215758. Logs get rotated every minute.
  
  logconv.pl is showing
  
  Searches: 265803(3322.54/sec) (199352.25/min)
  
  
  How can I figure out which ip address this query is coming from? Is
  there a way to fetch the ip using the connection id? conn=215758?
  
  grep conn=215758 fd= /var/log/dirsrv/slapd-INST/access*
  
  Unfortunately, if it has been rotated away, you won't be able to get the
  information from the access log.
  
 
 No luck .. looks like it has been rotated away. Any other thoughts?
 
 Is it correct to assume this is all coming from a single host? My
 thinking is that if I can kill the query coming from the host that it
 would solve the problem. 

Found the host using tcpdump. 

Thanks again for the help,

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] dirsrv access logs flooded from single connection id

2015-06-29 Thread Andrew E. Bruno
On Mon, Jun 29, 2015 at 10:29:24AM -0600, Rich Megginson wrote:
 On 06/29/2015 10:13 AM, Andrew E. Bruno wrote:
 Our dirsrv access logs on our freeipa master server are getting flooded
 with this:
 
 [29/Jun/2015:12:02:09 -0400] conn=215758 op=1355326784 SRCH
 base=cn=u2,cn=groups,cn=accounts,dc=ccr,dc=buffalo,dc=edu scope=0
 filter=(objectClass=*) attrs=objectClass posixgroup cn userPassword
 gidNumber member ipaNTSecurityIdentifier modifyTimestamp entryusn uid
 
 [29/Jun/2015:12:08:08 -0400] conn=215758 op=1356545457 RESULT err=0
 tag=101 nentries=0 etime=0 notes=P
 
 All from the same conn=215758. Logs get rotated every minute.
 
 logconv.pl is showing
 
 Searches: 265803(3322.54/sec) (199352.25/min)
 
 
 How can I figure out which ip address this query is coming from? Is
 there a way to fetch the ip using the connection id? conn=215758?
 
 grep conn=215758 fd= /var/log/dirsrv/slapd-INST/access*
 
 Unfortunately, if it has been rotated away, you won't be able to get the
 information from the access log.
 

No luck .. looks like it has been rotated away. Any other thoughts?

Is it correct to assume this is all coming from a single host? My
thinking is that if I can kill the query coming from the host that it
would solve the problem. 

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-25 Thread Andrew E. Bruno
On Mon, Jun 22, 2015 at 12:49:01PM -0400, Rob Crittenden wrote:
 
 You aren't seeing a replication agreement. You're seeing the Replication
 Update Vector (RUV).
 
 See http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
 
 You need to do something like:
 
 # ldapmodify -D cn=directory manager -W -a
 dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
 objectclass: extensibleObject
 replica-base-dn: o=ipaca
 replica-id: 97
 cn: clean 97
 
 
 Great, thanks for the clarification.
 
 Curious what's the difference between running the ldapmodify above and
 ipa-replica-manage clean-ruv?
 
 
 Nothing, for the IPA data. This is a remanant from a CA replication
 agreement and it was an oversight not to add similar RUV management options
 to the ipa-careplica-manage tool.
 

I'm still seeing some inconsistencies. Forgive me if I'm mis-interpreting any
of this output (still learning the ropes with FreeIPA here)..

Just trying to wrap my head around the RUVs. Trying to follow the docs here:
http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html

And after running the ldapsearch command to check for obsolete masters
I'm not seeing the replica ID for the old replica we deleted (rep2):


$  ldapsearch -xLLL -D cn=directory manager -W -s sub -b cn=config 
objectclass=nsds5replica
Enter LDAP Password: 
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
objectClass: nsds5replica
objectClass: top
objectClass: extensibleobject
nsDS5ReplicaType: 3
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsds5ReplicaLegacyConsumer: off
nsDS5ReplicaId: 4
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsState:: BABIa4xVJAABAA==
nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
nsds5ReplicaChangeCount: 1687559
nsds5replicareapactive: 0

dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5Replica
objectClass: extensibleobject
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaType: 3
nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep2
 falo.edu-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep3
 falo.edu-pki-tomcat,ou=csusers,cn=config
cn: replica
nsDS5ReplicaId: 96
nsDS5Flags: 1
nsState:: YAAPa4xVAAkACgABAA==
nsDS5ReplicaName: c458be8e-df9c11e4-a351aa45-2e06257b
nsds5ReplicaChangeCount: 9480
nsds5replicareapactive: 0


I see: 

dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config)
nsds5replicaid: 4

and 

dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsDS5ReplicaId: 96


In the above output I only see the old replica showing up under:

nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA...

According to the docs I need the nsds5replicaid for use in the CLEANALLRUV
task? 

I also checked the RUV tombstone entry as per the docs:

# ldapsearch -xLLL -D cn=directory manager -W -b dc=ccr,dc=buffalo,dc=edu 
'((nsuniqueid=---)(objectclass=nstombstone))'
Enter LDAP Password: 
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
objectClass: nsds5replica
objectClass: top
objectClass: extensibleobject
nsDS5ReplicaType: 3
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsds5ReplicaLegacyConsumer: off
nsDS5ReplicaId: 4
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsState:: BADycYxVJAABAA==
nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://rep1:389} 5527f77100040
 000 558c722800020004
nsds50ruv: {replica 5 ldap://rep3:389} 5537c77300050
 000 5582c7f600060005
nsds5agmtmaxcsn: 
dc=ccr,dc=buffalo,dc=edu;meTorep3;rep3;389;5;558c572b000a0004
nsruvReplicaLastModified: {replica 4 ldap://rep1:389} 55
 8c7204
nsruvReplicaLastModified: {replica 5 ldap://rep3:389} 00
 00
nsds5ReplicaChangeCount: 1689129
nsds5replicareapactive: 0

And only see nsds50ruv attributes for rep1, and rep3. However, still seeing
rep2 in the nsDS5ReplicaBindDN.

If I'm parsing this output correct, it appears RUVs for rep2 is already
cleaned? If so, how come the nsDS5ReplicaBindDN still exist? 

Also, why is there a nsds50ruv attribute for rep2 listed when I run this query
(but not the others above):


$ ldapsearch -xLLL -D cn=directory manager -W -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement

dn: 

Re: [Freeipa-users] ipa replica failure

2015-06-22 Thread Andrew E. Bruno
On Mon, Jun 22, 2015 at 10:02:59AM -0400, Rob Crittenden wrote:
 Andrew E. Bruno wrote:
 On Fri, Jun 19, 2015 at 03:18:50PM -0400, Rob Crittenden wrote:
 Rich Megginson wrote:
 On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
 
 Questions:
 
 0. Is it likely that after running out of file descriptors the dirsrv
 slapd database on rep2 was corrupted?
 
 That would appear to be the case based on correlation of events,
 although I've never seen that happen, and it is not supposed to happen.
 
 
 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
 remaining replica servers (rep1 and rep3)? Or should it just be run on
 the first master?
 
 I believe it should only be run on the first master, but it hung, so
 something is not right, and I'm not sure how to remedy the situation.
 
 How long did it hang, and where?
 
 This command was run on rep1 (first master):
 
 [rep1]$ ipa-replica-manage del rep2
 
 This command hung.. (~10 minutes..) until I Ctr-C. After noticing ldap
 queries were hanging on rep2 we ran this on rep2:
 
 [rep2]$ systemctl stop ipa
 (shutdown all ipa services on rep2)
 
 Then back on rep1 (first master)
 
 [rep1]$ ipa-replica-manage -v --force del rep2
 
 Which appeared to work ok.
 
 
 Do we need to run ipa-csreplicate-manage del as well?
 
 2. Why does the rep2 server still appear when querying the
 nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
 problems
 when we go to add rep2 back in?
 
 You should remove it.
 
 And ipa-csreplica-manage is the tool to do it.
 
 When I run this on rep1 (first master):
 
 [rep1]$ ipa-csreplica-manage list
 Directory Manager password:
 
 rep3: master
 rep1: master
 
 
 [rep1]$ ipa-csreplica-manage del rep2
 Directory Manager password:
 
 'rep1' has no replication agreement for 'rep2'
 
 But seems to still be there:
 
 [rep1]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
 objectClass=nsDS5ReplicationAgreement -LL
 
 dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
 tree,cn=config
 objectClass: top
 objectClass: nsds5replicationagreement
 cn: masterAgreement1-rep3-pki-tomcat
 nsDS5ReplicaRoot: o=ipaca
 nsDS5ReplicaHost: rep3
 nsDS5ReplicaPort: 389
 nsDS5ReplicaBindDN: cn=Replication Manager 
 cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
 nsDS5ReplicaBindMethod: Simple
 nsDS5ReplicaTransportInfo: TLS
 description: masterAgreement1-rep3-pki-tomcat
 nsds50ruv: {replicageneration} 5527f74b0060
 nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba005b
    5582c7e40004005b
 nsds50ruv: {replica 96 ldap://rep1:389} 5527f7540060
    5582cd190060
 nsds50ruv: {replica 97 ldap://rep2:389} 5527f7600061
    556f462b00040061
 nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
   000
 nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
   000
 nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
   000
 nsds5replicaLastUpdateStart: 20150619193149Z
 nsds5replicaLastUpdateEnd: 20150619193149Z
 nsds5replicaChangesSentSinceStartup:: OTY6MTMyLzAg
 nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental 
 upd
   ate succeeded
 nsds5replicaUpdateInProgress: FALSE
 nsds5replicaLastInitStart: 0
 nsds5replicaLastInitEnd: 0
 
 
 However, when I run the ldapsearch on rep3 it's not there (the
 cn=ipaca,cn=mapping tree,cn=config is not listed):
 
 [rep3]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
 objectClass=nsDS5ReplicationAgreement -LL
 
 dn: cn=meTorep1,cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2C dc\3Dedu,cn=mapping 
 tree,cn=config
 cn: meTorep1
 objectClass: nsds5replicationagreement
 objectClass: top
 nsDS5ReplicaTransportInfo: LDAP
 description: me to rep1
 nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
 nsDS5ReplicaHost: rep1
 
 
 
 
 3. What steps/commands can we take to verify rep2 was successfully
 removed and
 replication is behaving normally?
 
 The ldapsearch you performed already will confirm that the CA agreement has
 been removed.
 
 Still showing up.. Any thoughts?
 
 At this point we want to ensure both remaining masters are functional and
 operating normally. Any other commands you recommend running to check?
 
 You aren't seeing a replication agreement. You're seeing the Replication
 Update Vector (RUV).
 
 See http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
 
 You need to do something like:
 
 # ldapmodify -D cn=directory manager -W -a
 dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
 objectclass: extensibleObject
 replica-base-dn: o=ipaca
 replica-id: 97
 cn: clean 97
 

Great, thanks for the clarification. 

Curious what's the difference between running the ldapmodify above and
ipa-replica-manage clean-ruv? 



--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Andrew E. Bruno
On Fri, Jun 19, 2015 at 03:18:50PM -0400, Rob Crittenden wrote:
 Rich Megginson wrote:
 On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
 
 Questions:
 
 0. Is it likely that after running out of file descriptors the dirsrv
 slapd database on rep2 was corrupted?
 
 That would appear to be the case based on correlation of events,
 although I've never seen that happen, and it is not supposed to happen.
 
 
 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
 remaining replica servers (rep1 and rep3)? Or should it just be run on
 the first master?
 
 I believe it should only be run on the first master, but it hung, so
 something is not right, and I'm not sure how to remedy the situation.
 
 How long did it hang, and where?

This command was run on rep1 (first master):

[rep1]$ ipa-replica-manage del rep2 

This command hung.. (~10 minutes..) until I Ctr-C. After noticing ldap
queries were hanging on rep2 we ran this on rep2:

[rep2]$ systemctl stop ipa
(shutdown all ipa services on rep2)

Then back on rep1 (first master)

[rep1]$ ipa-replica-manage -v --force del rep2

Which appeared to work ok.

 
 Do we need to run ipa-csreplicate-manage del as well?
 
 2. Why does the rep2 server still appear when querying the
 nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
 problems
 when we go to add rep2 back in?
 
 You should remove it.
 
 And ipa-csreplica-manage is the tool to do it.

When I run this on rep1 (first master):

[rep1]$ ipa-csreplica-manage list
Directory Manager password: 

rep3: master
rep1: master


[rep1]$ ipa-csreplica-manage del rep2
Directory Manager password: 

'rep1' has no replication agreement for 'rep2'

But seems to still be there:

[rep1]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: masterAgreement1-rep3-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: rep3
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager 
cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
nsDS5ReplicaTransportInfo: TLS
description: masterAgreement1-rep3-pki-tomcat
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba005b
  5582c7e40004005b
nsds50ruv: {replica 96 ldap://rep1:389} 5527f7540060
  5582cd190060
nsds50ruv: {replica 97 ldap://rep2:389} 5527f7600061
  556f462b00040061
nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
 000
nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
 000
nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
 000
nsds5replicaLastUpdateStart: 20150619193149Z
nsds5replicaLastUpdateEnd: 20150619193149Z
nsds5replicaChangesSentSinceStartup:: OTY6MTMyLzAg
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
 ate succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0


However, when I run the ldapsearch on rep3 it's not there (the
cn=ipaca,cn=mapping tree,cn=config is not listed):

[rep3]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=meTorep1,cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2C dc\3Dedu,cn=mapping 
tree,cn=config
cn: meTorep1
objectClass: nsds5replicationagreement
objectClass: top
nsDS5ReplicaTransportInfo: LDAP
description: me to rep1
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaHost: rep1


 
 
 3. What steps/commands can we take to verify rep2 was successfully
 removed and
 replication is behaving normally?
 
 The ldapsearch you performed already will confirm that the CA agreement has
 been removed.

Still showing up.. Any thoughts? 

At this point we want to ensure both remaining masters are functional and
operating normally. Any other commands you recommend running to check? 

 
 8192 is extremely high.  The fact that you ran out of file descriptors
 at 8192 seems like a bug/fd leak somewhere.  I suppose you could, as a
 very temporary workaround, set the fd limit higher, but that is no
 guarantee that you won't run out again.
 
 Please file at least 1 ticket e.g. database corrupted when server ran
 out of file descriptors, with as much information about that particular
 problem as you can provide.
 

Will do.

Thanks very much for all the help!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Andrew E. Bruno
On Fri, Jun 19, 2015 at 09:08:15PM -0700, Janelle wrote:
 On 6/19/15 11:22 AM, Andrew E. Bruno wrote:
 Hello,
 
 First time trouble shooting an ipa server failure and looking for some
 guidance on how best to proceed.
 
 First some background on our setup:
 
 Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
 
 - ipa-server-4.1.0-18.el7.centos.3.x86_64
 - 389-ds-base-1.3.3.1-16.el7_1.x86_64
 
 3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
 replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
 
 We have ~3000 user accounts (~1000 active the rest disabled). We have
 ~700 hosts enrolled (all installed using ipa-client-install and running
 sssd). Hosts clients are a mix of centos 7 and centos 6.5.
 
 
 We recently discovered one of our replica servers (rep2) was not
 responding. A quick check of the dirsrv logs
 /var/log/dirsrv/slapd-/errors (sanitized):
 
  PR_Accept() failed, Netscape Portable Runtime error (Process open
  FD table is full.)
  ...
 
 The server was rebooted and after coming back up had these errors in the 
 logs:
 
  389-Directory/1.3.3.1 B2015.118.1941
  replica2:636 (/etc/dirsrv/slapd-)
 
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 
 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
 (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run 
 database recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
 (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run 
 database recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint 
 database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
 recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint 
 database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
 recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - 
 BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
 
 [16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting 
 up
 [16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time 
 Directory Server was running, recovering database.
 ...
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
 replica_check_for_data_reload: Warning: disordely shutdown for replica 
 dc=XXX. Check if DB RUV needs to be updated
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  5577006800030003
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  556f463200140004
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  556f4631004d0005
 [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send 
 startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection 
 refused)
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
 agmt=cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth 
 failed: LDAP error -1 (Can't contact LDAP server) ()
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
 replica_check_for_data_reload: Warning: disordely shutdown for replica 
 o=ipaca. Check if DB RUV needs to be updated
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  556f46290005005b
 [16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial 
 credentials for principal [ldap/rep2] in keytab 
 [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
 requested realm)
 [16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
 not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
 contact LDAP server) ((null)) errno 111 (Connection refused)
 [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform 
 interactive bind for id [] authentication mechanism [GSSAPI]: error -1 
 (Can't contact LDAP server)
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt=cn=meTorep1 
 (rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't 
 contact LDAP server) ()
 [16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password 
 Policy,cn

[Freeipa-users] ipa replica failure

2015-06-19 Thread Andrew E. Bruno
Hello,

First time trouble shooting an ipa server failure and looking for some
guidance on how best to proceed.

First some background on our setup:

Servers are running freeipa v4.1.0 on CentOS 7.1.1503:

- ipa-server-4.1.0-18.el7.centos.3.x86_64
- 389-ds-base-1.3.3.1-16.el7_1.x86_64

3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)

We have ~3000 user accounts (~1000 active the rest disabled). We have
~700 hosts enrolled (all installed using ipa-client-install and running
sssd). Hosts clients are a mix of centos 7 and centos 6.5.


We recently discovered one of our replica servers (rep2) was not
responding. A quick check of the dirsrv logs
/var/log/dirsrv/slapd-/errors (sanitized):

PR_Accept() failed, Netscape Portable Runtime error (Process open
FD table is full.)
...

The server was rebooted and after coming back up had these errors in the logs:

389-Directory/1.3.3.1 B2015.118.1941
replica2:636 (/etc/dirsrv/slapd-)

[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 
(BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, 
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, 
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - 
BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)

[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory 
Server was running, recovering database.
...
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. 
Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database 
RUV (from CL RUV) -  5577006800030003
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database 
RUV (from CL RUV) -  556f463200140004
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database 
RUV (from CL RUV) -  556f4631004d0005
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS 
request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
agmt=cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth 
failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. 
Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database 
RUV (from CL RUV) -  556f46290005005b
[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials 
for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 
(Cannot contact any KDC for requested realm)
[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
contact LDAP server) ((null)) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform 
interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't 
contact LDAP server)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt=cn=meTorep1 
(rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't 
contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password 
Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before 
the CoS Definition.
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not 
delete change record 301438 (rc: 32)