Hi Thierry, Rob, Flo,

unfortunately I have no failure log anymore (after a couple of
reinstallations they get lost). Anyway I'll try to reconstruct some
information to help you investigate further. The behaviour was:

1. the IPA replication started, coming rapidly to "[28/41]: setting up
initial replication".

2. Near the end of replication, after about 20 secs, the process aborted
with a message:
[ldap://idc02.my.dom.ain:389] reports: Update failed! Status: [Error
(-11) connection error: Unknown connection error (-11) - Total update
aborted]

idc02 is the working IPA/389-ds server.

on idc01 (the wannabe-replica) I found (in dirsrv error log):

(idc01:389): Received error -1 (Can't contact LDAP server):  for total
update operation

and somewhere else in the same file on idc01 a message similar to:

SASL encrypted packet length exceeds maximum allowed limit

3. At the time of crash I noticed (via a tcpdump session) some "TCP zero
window" message in the capture, sent by idc01 to idc02

4. After that the 389-ds server on idc01 was up, but many other IPA
parts were not (that's why I say the IPA replica setup crashed, no try
to rollback was made). And the working server was up, but somehow
"dirt", with some replica update vector (RUV) still pointing to idc01.

5. The solution was to pass "--dirsrv-config-file=custom.ldif" to
ipa-replica-install, with custom.ldif containing:

dn: cn=config
changetype: modify
replace: nsslapd-maxsasliosize
nsslapd-maxsasliosize: 4194304
replace: nsslapd-sasl-max-buffer-size
nsslapd-sasl-max-buffer-size: 4194304

(original value was 2097152 for both configuration variables).

This make me think that "TCP zero window" was only a consequence, not a
cause. After this tweak everything worked like a charme.

A couple of consideration:

1. I think you can reproduce the wrong behaviour doing the right
opposite as I did, decreasing those two values. I don't know exactly how
much.

2. Maybe ipa-replica-install should try to catch this situation, output
something more explanatory, and possibly try to rollback.


I'm sorry I've no real log to post, but I hope this helps anyway.

Thank you and regards,
Giulio




Il 10/04/2019 17:44, thierry bordaz ha scritto:
> 
> 
> On 4/10/19 4:59 PM, Rob Crittenden wrote:
>> Giulio Casella via FreeIPA-users wrote:
>>> Hi,
>>> I managed to fix it!
>>> The solution was to increase a couple of parameters in ldap config. I
>>> passed "--dirsrv-config-file=custom.ldif" to ipa-replica-install, with
>>> custom.ldif containing:
>>>
>>> dn: cn=config
>>> changetype: modify
>>> replace: nsslapd-maxsasliosize
>>> nsslapd-maxsasliosize: 4194304
>>> replace: nsslapd-sasl-max-buffer-size
>>> nsslapd-sasl-max-buffer-size: 4194304
>>>
>>> In brief I doubled the sasl buffer size, because I noticed a log message
>>> saying "SASL encrypted packet length exceeds maximum
>>> allowed limit".
>>>
>>> But the behaviour of ipa-replica-install was quite strange, it crashed,
>>> and in a packet capture session I noticed the presence of some "TCP zero
>>> window" packets sent from wannabe-replica to existing ipa server.
>>> Maybe developers want to try to catch that error and revert the
>>> operation, just like is done with other kind of errors.
>> Maybe one of the 389-ds devs have an idea. They're probably going to
>> want to see logs and what your definition of crash is.
>>
>> rob
> TCP zero window make me think to a client not reading fast enough.
> Is it transient/recoverable or not ?
> 
> Rob is right, if a problem is detected at 389-ds  level, access/errors
> logs are appreciated.
> and also the ipa-replica-install backstack when it crashed.
> 
> regards
> thierry
>>
>>> Ciao,
>>> g
>>>
>>>
>>> Il 01/04/2019 15:28, Giulio Casella via FreeIPA-users ha scritto:
>>>> Hi,
>>>> I'm still stuck on this, I tried to delete every reference to the old
>>>> server, with ipa commands ("ipa-replica-manage clean-ruv") and directly
>>>> in ldap (as reported in https://access.redhat.com/solutions/136993).
>>>>
>>>> If I try to "ipa-replica-manage list-ruv" on idc02 I get:
>>>>
>>>> Replica Update Vectors:
>>>>          idc02.my.dom.ain:389: 5
>>>> Certificate Server Replica Update Vectors:
>>>>          idc02.my.dom.ain:389: 91
>>>>
>>>> (same result looking directly into ldap)
>>>>
>>>> is it correct? Does a server has replica reference to itself?
>>>>
>>>> I also tried to instantiate a new server, idc03.my.dom.ain, never known
>>>> before (fresh centos install, ipa-client-install, ipa-replica-install).
>>>> The setup (surprisingly to me) failed (details below).
>>>>
>>>> At this point I suspect the problem is on idc02 (the only working
>>>> server), unrelated to previous server idc01.
>>>>
>>>> For completeness this is what I did:
>>>>
>>>> . Fresh install of a CentOS 7 box, updated, installed ipa software
>>>> (name
>>>> idc03.my.dom.ain)
>>>> . ipa-client-install --principal admin --domain=my.dom.ain
>>>> --realm=MY.DOM.AIN --force-join
>>>> . ipa-replica-install --setup-dns --no-forwarders --setup-ca
>>>>
>>>> Last command failed (in "[28/41]: setting up initial replication"), and
>>>> in /var/log/ipareplica-install.log of idc03 I read:
>>>>
>>>> [...]
>>>> 2019-03-28T09:30:48Z DEBUG   [28/41]: setting up initial replication
>>>> 2019-03-28T09:30:48Z DEBUG retrieving schema for SchemaCache
>>>> url=ldapi://%2fvar%2frun%2fslapd-MY-DOM-AIN.socket
>>>> conn=<ldap.ldapobject.SimpleLDAPObject instance at 0x7fb72af73050>
>>>> 2019-03-28T09:30:48Z DEBUG Destroyed connection
>>>> context.ldap2_140424739228880
>>>> 2019-03-28T09:30:48Z DEBUG Starting external process
>>>> 2019-03-28T09:30:48Z DEBUG args=/bin/systemctl --system daemon-reload
>>>> 2019-03-28T09:30:48Z DEBUG Process finished, return code=0
>>>> 2019-03-28T09:30:48Z DEBUG stdout=
>>>> 2019-03-28T09:30:48Z DEBUG stderr=
>>>> 2019-03-28T09:30:48Z DEBUG Starting external process
>>>> 2019-03-28T09:30:48Z DEBUG args=/bin/systemctl restart
>>>> dirsrv@MY-DOM-AIN.service
>>>> 2019-03-28T09:30:54Z DEBUG Process finished, return code=0
>>>> 2019-03-28T09:30:54Z DEBUG stdout=
>>>> 2019-03-28T09:30:54Z DEBUG stderr=
>>>> 2019-03-28T09:30:54Z DEBUG Restart of dirsrv@MY-DOM-AIN.service
>>>> complete
>>>> 2019-03-28T09:30:54Z DEBUG Created connection
>>>> context.ldap2_140424739228880
>>>> 2019-03-28T09:30:55Z DEBUG Fetching nsDS5ReplicaId from master
>>>> [attempt 1/5]
>>>> 2019-03-28T09:30:55Z DEBUG retrieving schema for SchemaCache
>>>> url=ldap://idc02.my.dom.ain:389 conn=<ldap.ldapobject.SimpleLDAPObject
>>>> instance at 0x7fb72bf8e128>
>>>> 2019-03-28T09:30:55Z DEBUG Successfully updated nsDS5ReplicaId.
>>>> 2019-03-28T09:30:55Z DEBUG Add or update replica config
>>>> cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping tree,cn=config
>>>> 2019-03-28T09:30:55Z DEBUG Added replica config
>>>> cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping tree,cn=config
>>>> 2019-03-28T09:30:55Z DEBUG Add or update replica config
>>>> cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping tree,cn=config
>>>> 2019-03-28T09:30:55Z DEBUG No update to
>>>> cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping tree,cn=config
>>>> necessary
>>>> 2019-03-28T09:30:55Z DEBUG Waiting for replication
>>>> (ldap://idc02.my.dom.ain:389)
>>>> cn=meToidc03.my.dom.ain,cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping
>>>> tree,cn=config
>>>> (objectclass=*)
>>>> 2019-03-28T09:30:55Z DEBUG Entry found
>>>> [LDAPEntry(ipapython.dn.DN('cn=meToidc03.my.dom.ain,cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping
>>>>
>>>> tree,cn=config'), {u'nsds5replicaLastInitStart': ['19700101000000Z'],
>>>> u'nsds5replicaUpdateInProgress': ['FALSE'], u'cn':
>>>> ['meToidc03.my.dom.ain'], u'objectClass': ['nsds5replicationagreement',
>>>> 'top'], u'nsds5replicaLastUpdateEnd': ['19700101000000Z'],
>>>> u'nsDS5ReplicaRoot': ['dc=my,dc=dom,dc=ain'], u'nsDS5ReplicaHost':
>>>> ['idc03.my.dom.ain'], u'nsds5replicaLastUpdateStatus': ['Error (0) No
>>>> replication sessions started since server startup'],
>>>> u'nsDS5ReplicaBindMethod': ['SASL/GSSAPI'], u'nsds5ReplicaStripAttrs':
>>>> ['modifiersName modifyTimestamp internalModifiersName
>>>> internalModifyTimestamp'], u'nsds5replicaLastUpdateStart':
>>>> ['19700101000000Z'], u'nsDS5ReplicaPort': ['389'],
>>>> u'nsDS5ReplicaTransportInfo': ['LDAP'], u'description': ['me to
>>>> idc03.my.dom.ain'], u'nsds5replicareapactive': ['0'],
>>>> u'nsds5replicaChangesSentSinceStartup': [''], u'nsds5replicaTimeout':
>>>> ['120'], u'nsDS5ReplicatedAttributeList': ['(objectclass=*) $ EXCLUDE
>>>> memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth
>>>> krbloginfailedcount'], u'nsds5replicaLastInitEnd': ['19700101000000Z'],
>>>> u'nsDS5ReplicatedAttributeListTotal': ['(objectclass=*) $ EXCLUDE
>>>> entryusn krblastsuccessfulauth krblastfailedauth
>>>> krbloginfailedcount']})]
>>>> 2019-03-28T09:30:55Z DEBUG Entry found
>>>> [LDAPEntry(ipapython.dn.DN('cn=meToidc02.my.dom.ain,cn=replica,cn=dc\=my\,dc\=dom\,dc\=ain,cn=mapping
>>>>
>>>> tree,cn=config'), {u'nsds5replicaLastInitStart': ['19700101000000Z'],
>>>> u'nsds5replicaUpdateInProgress': ['FALSE'], u'cn':
>>>> ['meToidc02.my.dom.ain'], u'objectClass': ['nsds5replicationagreement',
>>>> 'top'], u'nsds5replicaLastUpdateEnd': ['19700101000000Z'],
>>>> u'nsDS5ReplicaRoot': ['dc=my,dc=dom,dc=ain'], u'nsDS5ReplicaHost':
>>>> ['idc02.my.dom.ain'], u'nsds5replicaLastUpdateStatus': ['Error (0) No
>>>> replication sessions started since server startup'],
>>>> u'nsDS5ReplicaBindMethod': ['SASL/GSSAPI'], u'nsds5ReplicaStripAttrs':
>>>> ['modifiersName modifyTimestamp internalModifiersName
>>>> internalModifyTimestamp'], u'nsds5replicaLastUpdateStart':
>>>> ['19700101000000Z'], u'nsDS5ReplicaPort': ['389'],
>>>> u'nsDS5ReplicaTransportInfo': ['LDAP'], u'description': ['me to
>>>> idc02.my.dom.ain'], u'nsds5replicareapactive': ['0'],
>>>> u'nsds5replicaChangesSentSinceStartup': [''], u'nsds5replicaTimeout':
>>>> ['120'], u'nsDS5ReplicatedAttributeList': ['(objectclass=*) $ EXCLUDE
>>>> memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth
>>>> krbloginfailedcount'], u'nsds5replicaLastInitEnd': ['19700101000000Z'],
>>>> u'nsDS5ReplicatedAttributeListTotal': ['(objectclass=*) $ EXCLUDE
>>>> entryusn krblastsuccessfulauth krblastfailedauth
>>>> krbloginfailedcount']})]
>>>> 2019-03-28T09:31:15Z DEBUG Traceback (most recent call last):
>>>>    File
>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/service.py",
>>>> line 570, in start_creation
>>>>      run_step(full_msg, method)
>>>>    File
>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/service.py",
>>>> line 560, in run_step
>>>>      method()
>>>>    File
>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py",
>>>> line
>>>> 456, in __setup_replica
>>>>      cacert=self.ca_file
>>>>    File
>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>> line 1817, in setup_promote_replication
>>>>      raise RuntimeError("Failed to start replication")
>>>> RuntimeError: Failed to start replication
>>>> [...]
>>>>
>>>> while in /var/log/dirsrv/slapd-MY-DOM-AIN/errors of idc02 I can find:
>>>>
>>>> [...]
>>>> [28/Mar/2019:10:30:56.602197981 +0100] - INFO - NSMMReplicationPlugin -
>>>> repl5_tot_run - Beginning total update of replica
>>>> "agmt="cn=meToidc03.my.dom.ain" (idc03:389)".
>>>> [28/Mar/2019:10:31:15.787867217 +0100] - ERR - NSMMReplicationPlugin -
>>>> repl5_tot_log_operation_failure - agmt="cn=meToidc03.my.dom.ain"
>>>> (idc03:389): Received error -1 (Can't contact LDAP server):  for total
>>>> update operation
>>>> [28/Mar/2019:10:31:15.789885458 +0100] - ERR - NSMMReplicationPlugin -
>>>> release_replica - agmt="cn=meToidc03.my.dom.ain" (idc03:389): Unable to
>>>> send endReplication extended operation (Can't contact LDAP server)
>>>> [28/Mar/2019:10:31:15.791374133 +0100] - ERR - NSMMReplicationPlugin -
>>>> repl5_tot_run - Total update failed for replica
>>>> "agmt="cn=meToidc03.my.dom.ain" (idc03:389)", error (-11)
>>>> [28/Mar/2019:10:31:15.823809612 +0100] - INFO - NSMMReplicationPlugin -
>>>> bind_and_check_pwp - agmt="cn=meToidc03.my.dom.ain" (idc03:389):
>>>> Replication bind with GSSAPI auth resumed
>>>> [28/Mar/2019:10:31:16.221049084 +0100] - WARN - NSMMReplicationPlugin -
>>>> repl5_inc_run - agmt="cn=meToidc03.my.dom.ain" (idc03:389): The remote
>>>> replica has a different database generation ID than the local database.
>>>>   You may have to reinitialize the remote replica, or the local
>>>> replica.
>>>> [28/Mar/2019:10:31:19.234198978 +0100] - WARN - NSMMReplicationPlugin -
>>>> repl5_inc_run - agmt="cn=meToidc03.my.dom.ain" (idc03:389): The remote
>>>> replica has a different database generation ID than the local database.
>>>>   You may have to reinitialize the remote replica, or the local
>>>> replica.
>>>> [28/Mar/2019:10:31:22.247206811 +0100] - WARN - NSMMReplicationPlugin -
>>>> repl5_inc_run - agmt="cn=meToidc03.my.dom.ain" (idc03:389): The remote
>>>> replica has a different database generation ID than the local database.
>>>>   You may have to reinitialize the remote replica, or the local
>>>> replica.
>>>>
>>>> Last message keeps repeating until I uninstall replica on idc03.
>>>>
>>>>
>>>> How can I restore a scenario with a redundant setup (more than one ipa
>>>> server)?
>>>>
>>>> Thanks in advance,
>>>> Giulio Casella
>>>>
>>>>
>>>>
>>>>
>>>> Il 26/03/2019 11:08, Giulio Casella via FreeIPA-users ha scritto:
>>>>> Hi Flo,
>>>>>
>>>>> Il 26/03/2019 09:45, Florence Blanc-Renaud via FreeIPA-users ha
>>>>> scritto:
>>>>>> On 3/20/19 9:32 AM, Giulio Casella via FreeIPA-users wrote:
>>>>>>> Hi everyone,
>>>>>>> I'm stuck with a broken replica. I had a setup with two ipa
>>>>>>> server in
>>>>>>> replica (ipa-server-4.6.4 on CentOS 7.6), let's say "idc01" and
>>>>>>> "idc02".
>>>>>>>
>>>>>>> Due to heavy load idc01 crashed many times, and was not working
>>>>>>> anymore.
>>>>>>>
>>>>>>> So I tried to redo the replica again. At first I tried to
>>>>>>> "ipa-replica-manage re-initialize", with no success.
>>>>>>>
>>>>>>> Now I'm trying to redo from scratch the replica setup: on idc02 I
>>>>>>> removed the segments (ipa topologysegment-del, for both ca and
>>>>>>> domain
>>>>>>> suffix), on idc01 I removed everything (ipa-server-install
>>>>>>> --uninstall),
>>>>>>> then I joined domain (ipa-client-install), and everything is working
>>>>>>> so far.
>>>>>>>
>>>>>>> When doing "ipa-replica-install" on idc01 I get:
>>>>>>>
>>>>>>> [...]
>>>>>>>     [28/41]: setting up initial replication
>>>>>>> Starting replication, please wait until this has completed.
>>>>>>> Update in progress, 22 seconds elapsed
>>>>>>> [ldap://idc02.my.dom.ain:389] reports: Update failed! Status: [Error
>>>>>>> (-11) connection error: Unknown connection error (-11) - Total
>>>>>>> update
>>>>>>> aborted]
>>>>>>>
>>>>>>>
>>>>>>> And on idc02 (the working server), in
>>>>>>> /var/log/dirsrv/slapd-MY-DOM-AIN/errors I find lines stating:
>>>>>>>
>>>>>>> [20/Mar/2019:09:28:06.545187923 +0100] - INFO -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> repl5_tot_run - Beginning total update of replica
>>>>>>> "agmt="cn=meToidc01.my.dom.ain" (idc01:389)".
>>>>>>> [20/Mar/2019:09:28:26.528046160 +0100] - ERR -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> perform_operation - agmt="cn=meToidc01.my.dom.ain" (idc01:389):
>>>>>>> Failed
>>>>>>> to send extended operation: LDAP error -1 (Can't contact LDAP
>>>>>>> server)
>>>>>>> [20/Mar/2019:09:28:26.530763939 +0100] - ERR -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> repl5_tot_log_operation_failure - agmt="cn=meToidc01.my.dom.ain"
>>>>>>> (idc01:389): Received error -1 (Can't contact LDAP server):  for
>>>>>>> total
>>>>>>> update operation
>>>>>>> [20/Mar/2019:09:28:26.532678072 +0100] - ERR -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> release_replica - agmt="cn=meToidc01.my.dom.ain" (idc01:389):
>>>>>>> Unable to
>>>>>>> send endReplication extended operation (Can't contact LDAP server)
>>>>>>> [20/Mar/2019:09:28:26.534307539 +0100] - ERR -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> repl5_tot_run - Total update failed for replica
>>>>>>> "agmt="cn=meToidc01.my.dom.ain" (idc01:389)", error (-11)
>>>>>>> [20/Mar/2019:09:28:26.561763168 +0100] - INFO -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> bind_and_check_pwp - agmt="cn=meToidc01.my.dom.ain" (idc01:389):
>>>>>>> Replication bind with GSSAPI auth resumed
>>>>>>> [20/Mar/2019:09:28:26.582389258 +0100] - WARN -
>>>>>>> NSMMReplicationPlugin -
>>>>>>> repl5_inc_run - agmt="cn=meToidc01.my.dom.ain" (idc01:389): The
>>>>>>> remote
>>>>>>> replica has a different database generation ID than the local
>>>>>>> database.
>>>>>>>    You may have to reinitialize the remote replica, or the local
>>>>>>> replica.
>>>>>>>
>>>>>>>
>>>>>>> It seems that idc02 remembers something about the old replica.
>>>>>>>
>>>>>>> Any hint?
>>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In order to clean every reference to the old replica:
>>>>>> (on idc01)
>>>>>> $ ipa-server-install --uninstall -U
>>>>>> $ kdestroy -A
>>>>>>
>>>>>> (on idc02)
>>>>>> $ ipa-replica-manage del idc01.my.dom.ain --clean --force
>>>>>>
>>>>>> Then you should be able to reinstall idc01 as a replica.
>>>>> No way, same result, it hangs in "[28/41]: setting up initial
>>>>> replication", after about 20 secs.
>>>>> I also tried, on idc02, to clean all RUVs referring idc01, with no
>>>>> luck.
>>>>> _______________________________________________
>>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>>>> To unsubscribe send an email to
>>>>> freeipa-users-le...@lists.fedorahosted.org
>>>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>>>>> List Guidelines:
>>>>> https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>>> List Archives:
>>>>> https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
>>>>>
>>>>>
>>>> _______________________________________________
>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>>> To unsubscribe send an email to
>>>> freeipa-users-le...@lists.fedorahosted.org
>>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives:
>>>> https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
>>>>
>>>>
>>> _______________________________________________
>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>> To unsubscribe send an email to
>>> freeipa-users-le...@lists.fedorahosted.org
>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives:
>>> https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
>>>
>>>
> 
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org

Reply via email to