On 09/06/2012 06:05 PM, Martin Kosek wrote: > On 09/06/2012 05:55 PM, Rob Crittenden wrote: >> Rob Crittenden wrote: >>> Rob Crittenden wrote: >>>> Martin Kosek wrote: >>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote: >>>>>> Rob Crittenden wrote: >>>>>>> Martin Kosek wrote: >>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote: >>>>>>>>> Martin Kosek wrote: >>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote: >>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the >>>>>>>>>>> other servers. >>>>>>>>>>> This can confuse things if the replica is re-added, and it also >>>>>>>>>>> causes the >>>>>>>>>>> server to calculate changes against a server that may no longer >>>>>>>>>>> exist. >>>>>>>>>>> >>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all >>>>>>>>>>> available >>>>>>>>>>> replicas to clean this RUV data. >>>>>>>>>>> >>>>>>>>>>> This patch will create this task at deletion time to hopefully >>>>>>>>>>> clean things up. >>>>>>>>>>> >>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the >>>>>>>>>>> time >>>>>>>>>>> the >>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data >>>>>>>>>>> may be >>>>>>>>>>> re-propogated around. >>>>>>>>>>> >>>>>>>>>>> To make things easier in this case I've added two new commands to >>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the >>>>>>>>>>> servers we >>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the >>>>>>>>>>> replication id of a >>>>>>>>>>> server that no longer exists to try the cleanallruv step again. >>>>>>>>>>> >>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a >>>>>>>>>>> replica id that >>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in >>>>>>>>>>> enough scary >>>>>>>>>>> warnings about this. >>>>>>>>>>> >>>>>>>>>>> rob >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Good work there, this should make cleaning RUVs much easier than >>>>>>>>>> with the >>>>>>>>>> previous version. >>>>>>>>>> >>>>>>>>>> This is what I found during review: >>>>>>>>>> >>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I >>>>>>>>>> think >>>>>>>>>> it would >>>>>>>>>> help if we for example have all info for commands indented. This >>>>>>>>>> way >>>>>>>>>> user could >>>>>>>>>> simply over-look the new commands in the man page. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make >>>>>>>>>> them >>>>>>>>>> consistent with the rest of the commands (re-initialize, >>>>>>>>>> force-sync). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an >>>>>>>>>> unattended way >>>>>>>>>> (for better testing), i.e. respect --force option as we already >>>>>>>>>> do for >>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the >>>>>>>>>> future. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 4) (minor) The new question (and the del too) does not react too >>>>>>>>>> well for >>>>>>>>>> CTRL+D: >>>>>>>>>> >>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force >>>>>>>>>> Clean the Replication Update Vector for >>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>>>>>>> >>>>>>>>>> Cleaning the wrong replica ID will cause that server to no >>>>>>>>>> longer replicate so it may miss updates while the process >>>>>>>>>> is running. It would need to be re-initialized to maintain >>>>>>>>>> consistency. Be very careful. >>>>>>>>>> Continue to clean? [no]: unexpected error: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite >>>>>>>>>> confusing >>>>>>>>>> as it reports that command is wrong and not the parameter: >>>>>>>>>> >>>>>>>>>> # ipa-replica-manage clean_ruv >>>>>>>>>> Usage: ipa-replica-manage [options] >>>>>>>>>> >>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv | >>>>>>>>>> force-sync | >>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv] >>>>>>>>>> >>>>>>>>>> It seems you just forgot to specify the error message in the >>>>>>>>>> command >>>>>>>>>> definition >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails >>>>>>>>>> with an >>>>>>>>>> unexpected error: >>>>>>>>>> >>>>>>>>>> [root@vm-086 ~]# ipa-replica-manage clean_ruv 5 >>>>>>>>>> Clean the Replication Update Vector for >>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>>>>>>> >>>>>>>>>> Cleaning the wrong replica ID will cause that server to no >>>>>>>>>> longer replicate so it may miss updates while the process >>>>>>>>>> is running. It would need to be re-initialized to maintain >>>>>>>>>> consistency. Be very careful. >>>>>>>>>> Continue to clean? [no]: y >>>>>>>>>> unexpected error: {'desc': 'Operations error'} >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors: >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>>> cleanAllRUV_task: failed >>>>>>>>>> to connect to repl agreement connection >>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica, >>>>>>>>>> >>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> tree,cn=config), error 105 >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>>> cleanAllRUV_task: replica >>>>>>>>>> (cn=meTovm-055.idm.lab. >>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> tree, cn=config) has not been cleaned. You will need to rerun >>>>>>>>>> the >>>>>>>>>> CLEANALLRUV task on this replica. >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>>> cleanAllRUV_task: Task >>>>>>>>>> failed (1) >>>>>>>>>> >>>>>>>>>> In this case I think we should inform user that the command failed, >>>>>>>>>> possibly >>>>>>>>>> because of disconnected replicas and that they could enable the >>>>>>>>>> replicas and >>>>>>>>>> try again. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py: >>>>>>>>>> + except ldap.INSUFFICIENT_ACCESS: >>>>>>>>>> + # We can't make the server we're removing read-only >>>>>>>>>> but >>>>>>>>>> + # this isn't a show-stopper >>>>>>>>>> + root_logger.debug("No permission to switch replica to >>>>>>>>>> read-only, >>>>>>>>>> continuing anyway") >>>>>>>>>> + pass >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think this addresses everything. >>>>>>>>> >>>>>>>>> rob >>>>>>>> >>>>>>>> Thanks, almost there! I just found one more issue which needs to be >>>>>>>> fixed >>>>>>>> before we push: >>>>>>>> >>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force >>>>>>>> Directory Manager password: >>>>>>>> >>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing >>>>>>>> removal >>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': >>>>>>>> "Can't >>>>>>>> contact LDAP server"} >>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>>>>>>> >>>>>>>> There were issues removing a connection: %d format: a number is >>>>>>>> required, not str >>>>>>>> >>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': >>>>>>>> "Can't >>>>>>>> contact LDAP server"} >>>>>>>> >>>>>>>> This is a traceback I retrieved: >>>>>>>> Traceback (most recent call last): >>>>>>>> File "/sbin/ipa-replica-manage", line 425, in del_master >>>>>>>> del_link(realm, r, hostname, options.dirman_passwd, force=True) >>>>>>>> File "/sbin/ipa-replica-manage", line 271, in del_link >>>>>>>> repl1.cleanallruv(replica_id) >>>>>>>> File >>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", >>>>>>>> line 1094, in cleanallruv >>>>>>>> root_logger.debug("Creating CLEANALLRUV task for replica id >>>>>>>> %d" % >>>>>>>> replicaId) >>>>>>>> >>>>>>>> >>>>>>>> The problem here is that you don't convert replica_id to int in this >>>>>>>> part: >>>>>>>> + replica_id = None >>>>>>>> + if repl2: >>>>>>>> + replica_id = repl2._get_replica_id(repl2.conn, None) >>>>>>>> + else: >>>>>>>> + servers = get_ruv(realm, replica1, dirman_passwd) >>>>>>>> + for (netloc, rid) in servers: >>>>>>>> + if netloc.startswith(replica2): >>>>>>>> + replica_id = rid >>>>>>>> + break >>>>>>>> >>>>>>>> Martin >>>>>>>> >>>>>>> >>>>>>> Updated patch using new mechanism in 389-ds-base. This should more >>>>>>> thoroughly clean out RUV data when a replica is being deleted, and >>>>>>> provide for a way to delete RUV data afterwards too if necessary. >>>>>>> >>>>>>> rob >>>>>> >>>>>> Rebased patch >>>>>> >>>>>> rob >>>>>> >>>>> >>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped >>>>> elsewhere. >>>>> >>>>> 1) The following KeyboardInterrupt except class looks suspicious. I >>>>> know why >>>>> you have it there, but since it is generally a bad thing to do, some >>>>> comment >>>>> why it is needed would be useful. >>>>> >>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, >>>>> dirman_passwd, >>>>> force=False): >>>>> repl1.delete_agreement(replica2) >>>>> repl1.delete_referral(replica2) >>>>> >>>>> + if type1 == replication.IPA_REPLICA: >>>>> + if repl2: >>>>> + ruv = repl2._get_replica_id(repl2.conn, None) >>>>> + else: >>>>> + ruv = get_ruv_by_host(realm, replica1, replica2, >>>>> dirman_passwd) >>>>> + >>>>> + try: >>>>> + repl1.cleanallruv(ruv) >>>>> + except KeyboardInterrupt: >>>>> + pass >>>>> + >>>>> >>>>> Maybe you just wanted to do some cleanup and then "raise" again? >>>> >>>> No, it is there because it is safe to break out of it. The task will >>>> continue to run. I added some verbiage. >>>> >>>>> >>>>> 2) This is related to 1), but when some replica is down, >>>>> "ipa-replica-manage >>>>> del" may wait indefinitely when some remote replica is down, right? >>>>> >>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com >>>>> Deleting a master is irreversible. >>>>> To reconnect to the remote master you will need to prepare a new >>>>> replica file >>>>> and re-install. >>>>> Continue to delete? [no]: y >>>>> ipa: INFO: Setting agreement >>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping >>>>> >>>>> >>>>> >>>>> tree,cn=config schedule to 2358-2359 0 to force synch >>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement >>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping >>>>> >>>>> >>>>> >>>>> tree,cn=config >>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica >>>>> acquired >>>>> successfully: Incremental update succeeded: start: 0: end: 0 >>>>> Background task created to clean replication data >>>>> >>>>> ... after about a minute I hit CTRL+C >>>>> >>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to >>>>> 'vm-055.idm.lab.bos.redhat.com' >>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record >>>>> does not >>>>> contain 'vm-055.idm.lab.bos.redhat.com.' >>>>> You may need to manually remove them from the tree >>>>> >>>>> I think it would be better to inform user that some remote replica is >>>>> down or >>>>> at least that we are waiting for the task to complete. Something like >>>>> that: >>>>> >>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com >>>>> ... >>>>> Background task created to clean replication data >>>>> Replication data clean up may take very long time if some replica is >>>>> unreachable >>>>> Hit CTRL+C to interrupt the wait >>>>> ^C Clean up wait interrupted >>>>> .... >>>>> [continue with del] >>>> >>>> Yup, did this in #1. >>>> >>>>> >>>>> 3) (minor) When there is a cleanruv task running and you run >>>>> "ipa-replica-manage del", there is a unexpected error message with >>>>> duplicate >>>>> task object in LDAP: >>>>> >>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force >>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing >>>>> removal >>>>> FAIL >>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't >>>>> contact LDAP server"} >>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>>>> >>>>> There were issues removing a connection: This entry already exists >>>>> <<<<<<<<< >>>>> >>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't >>>>> contact LDAP server"} >>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record >>>>> does not >>>>> contain 'vm-072.idm.lab.bos.redhat.com.' >>>>> You may need to manually remove them from the tree >>>>> >>>>> >>>>> I think it should be enough to just catch for "entry already exists" in >>>>> cleanallruv function, and in such case print a relevant error message >>>>> bail out. >>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too. >>>> >>>> Good catch, fixed. >>>> >>>>> >>>>> >>>>> 4) (minor): In make_readonly function, there is a redundant "pass" >>>>> statement: >>>>> >>>>> + def make_readonly(self): >>>>> + """ >>>>> + Make the current replication agreement read-only. >>>>> + """ >>>>> + dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'), >>>>> + ('cn', 'plugins'), ('cn', 'config')) >>>>> + >>>>> + mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')] >>>>> + try: >>>>> + self.conn.modify_s(dn, mod) >>>>> + except ldap.INSUFFICIENT_ACCESS: >>>>> + # We can't make the server we're removing read-only but >>>>> + # this isn't a show-stopper >>>>> + root_logger.debug("No permission to switch replica to >>>>> read-only, >>>>> continuing anyway") >>>>> + pass <<<<<<<<<<<<<<< >>>> >>>> Yeah, this is one of my common mistakes. I put in a pass initially, then >>>> add logging in front of it and forget to delete the pass. Its gone now. >>>> >>>>> >>>>> >>>>> 5) In clean_ruv, I think allowing a --force option to bypass the >>>>> user_input >>>>> would be helpful (at least for test automation): >>>>> >>>>> + if not ipautil.user_input("Continue to clean?", False): >>>>> + sys.exit("Aborted") >>>> >>>> Yup, added. >>>> >>>> rob >>> >>> Slightly revised patch. I still had a window open with one unsaved change. >>> >>> rob >>> >> >> Apparently there were two unsaved changes, one of which was lost. This adds >> in >> the 'entry already exists' fix. >> >> rob >> > > Just one last thing (otherwise the patch is OK) - I don't think this is what > we > want :-) > > # ipa-replica-manage clean-ruv 8 > Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 > > Cleaning the wrong replica ID will cause that server to no > longer replicate so it may miss updates while the process > is running. It would need to be re-initialized to maintain > consistency. Be very careful. > Continue to clean? [no]: y <<<<<< > Aborted > > > Nor this exception, (your are checking for wrong exception): > > # ipa-replica-manage clean-ruv 8 > Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 > > Cleaning the wrong replica ID will cause that server to no > longer replicate so it may miss updates while the process > is running. It would need to be re-initialized to maintain > consistency. Be very careful. > Continue to clean? [no]: > unexpected error: This entry already exists > > This is the exception: > > Traceback (most recent call last): > File "/sbin/ipa-replica-manage", line 651, in <module> > main() > File "/sbin/ipa-replica-manage", line 648, in main > clean_ruv(realm, args[1], options) > File "/sbin/ipa-replica-manage", line 373, in clean_ruv > thisrepl.cleanallruv(ruv) > File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", > line 1136, in cleanallruv > self.conn.addEntry(e) > File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in > addEntry > self.__handle_errors(e, arg_desc=arg_desc) > File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in > __handle_errors > raise errors.DuplicateEntry() > ipalib.errors.DuplicateEntry: This entry already exists > > Martin >
On another matter, I just noticed that CLEANRUV is not proceeding if I have a winsync replica defined (and it is even up): # ipa-replica-manage list dc.ad.test: winsync <<<<<<< vm-072.idm.lab.bos.redhat.com: master vm-086.idm.lab.bos.redhat.com: master [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Waiting for all the replicas to receive all the deleted replica updates... [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later. [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all replicas caught up, retrying in 10 seconds [06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later. [06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all replicas caught up, retrying in 20 seconds [06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later. [06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all replicas caught up, retrying in 40 seconds I don't think this is OK. Adding Rich to CC to follow on this one. Martin _______________________________________________ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel