On 09/06/2012 05:55 PM, Rob Crittenden wrote: > Rob Crittenden wrote: >> Rob Crittenden wrote: >>> Martin Kosek wrote: >>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote: >>>>> Rob Crittenden wrote: >>>>>> Martin Kosek wrote: >>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote: >>>>>>>> Martin Kosek wrote: >>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote: >>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the >>>>>>>>>> other servers. >>>>>>>>>> This can confuse things if the replica is re-added, and it also >>>>>>>>>> causes the >>>>>>>>>> server to calculate changes against a server that may no longer >>>>>>>>>> exist. >>>>>>>>>> >>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all >>>>>>>>>> available >>>>>>>>>> replicas to clean this RUV data. >>>>>>>>>> >>>>>>>>>> This patch will create this task at deletion time to hopefully >>>>>>>>>> clean things up. >>>>>>>>>> >>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the >>>>>>>>>> time >>>>>>>>>> the >>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data >>>>>>>>>> may be >>>>>>>>>> re-propogated around. >>>>>>>>>> >>>>>>>>>> To make things easier in this case I've added two new commands to >>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the >>>>>>>>>> servers we >>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the >>>>>>>>>> replication id of a >>>>>>>>>> server that no longer exists to try the cleanallruv step again. >>>>>>>>>> >>>>>>>>>> This is quite dangerous though. If you run cleanruv against a >>>>>>>>>> replica id that >>>>>>>>>> does exist it can cause a loss of data. I believe I've put in >>>>>>>>>> enough scary >>>>>>>>>> warnings about this. >>>>>>>>>> >>>>>>>>>> rob >>>>>>>>>> >>>>>>>>> >>>>>>>>> Good work there, this should make cleaning RUVs much easier than >>>>>>>>> with the >>>>>>>>> previous version. >>>>>>>>> >>>>>>>>> This is what I found during review: >>>>>>>>> >>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I >>>>>>>>> think >>>>>>>>> it would >>>>>>>>> help if we for example have all info for commands indented. This >>>>>>>>> way >>>>>>>>> user could >>>>>>>>> simply over-look the new commands in the man page. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make >>>>>>>>> them >>>>>>>>> consistent with the rest of the commands (re-initialize, >>>>>>>>> force-sync). >>>>>>>>> >>>>>>>>> >>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an >>>>>>>>> unattended way >>>>>>>>> (for better testing), i.e. respect --force option as we already >>>>>>>>> do for >>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the >>>>>>>>> future. >>>>>>>>> >>>>>>>>> >>>>>>>>> 4) (minor) The new question (and the del too) does not react too >>>>>>>>> well for >>>>>>>>> CTRL+D: >>>>>>>>> >>>>>>>>> # ipa-replica-manage clean_ruv 3 --force >>>>>>>>> Clean the Replication Update Vector for >>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>>>>>> >>>>>>>>> Cleaning the wrong replica ID will cause that server to no >>>>>>>>> longer replicate so it may miss updates while the process >>>>>>>>> is running. It would need to be re-initialized to maintain >>>>>>>>> consistency. Be very careful. >>>>>>>>> Continue to clean? [no]: unexpected error: >>>>>>>>> >>>>>>>>> >>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite >>>>>>>>> confusing >>>>>>>>> as it reports that command is wrong and not the parameter: >>>>>>>>> >>>>>>>>> # ipa-replica-manage clean_ruv >>>>>>>>> Usage: ipa-replica-manage [options] >>>>>>>>> >>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv | >>>>>>>>> force-sync | >>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv] >>>>>>>>> >>>>>>>>> It seems you just forgot to specify the error message in the >>>>>>>>> command >>>>>>>>> definition >>>>>>>>> >>>>>>>>> >>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails >>>>>>>>> with an >>>>>>>>> unexpected error: >>>>>>>>> >>>>>>>>> [root@vm-086 ~]# ipa-replica-manage clean_ruv 5 >>>>>>>>> Clean the Replication Update Vector for >>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>>>>>> >>>>>>>>> Cleaning the wrong replica ID will cause that server to no >>>>>>>>> longer replicate so it may miss updates while the process >>>>>>>>> is running. It would need to be re-initialized to maintain >>>>>>>>> consistency. Be very careful. >>>>>>>>> Continue to clean? [no]: y >>>>>>>>> unexpected error: {'desc': 'Operations error'} >>>>>>>>> >>>>>>>>> >>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors: >>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>> cleanAllRUV_task: failed >>>>>>>>> to connect to repl agreement connection >>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica, >>>>>>>>> >>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>>>>>> >>>>>>>>> >>>>>>>>> tree,cn=config), error 105 >>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>> cleanAllRUV_task: replica >>>>>>>>> (cn=meTovm-055.idm.lab. >>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> tree, cn=config) has not been cleaned. You will need to rerun >>>>>>>>> the >>>>>>>>> CLEANALLRUV task on this replica. >>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>> cleanAllRUV_task: Task >>>>>>>>> failed (1) >>>>>>>>> >>>>>>>>> In this case I think we should inform user that the command failed, >>>>>>>>> possibly >>>>>>>>> because of disconnected replicas and that they could enable the >>>>>>>>> replicas and >>>>>>>>> try again. >>>>>>>>> >>>>>>>>> >>>>>>>>> 7) (minor) "pass" is now redundant in replication.py: >>>>>>>>> + except ldap.INSUFFICIENT_ACCESS: >>>>>>>>> + # We can't make the server we're removing read-only >>>>>>>>> but >>>>>>>>> + # this isn't a show-stopper >>>>>>>>> + root_logger.debug("No permission to switch replica to >>>>>>>>> read-only, >>>>>>>>> continuing anyway") >>>>>>>>> + pass >>>>>>>>> >>>>>>>> >>>>>>>> I think this addresses everything. >>>>>>>> >>>>>>>> rob >>>>>>> >>>>>>> Thanks, almost there! I just found one more issue which needs to be >>>>>>> fixed >>>>>>> before we push: >>>>>>> >>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force >>>>>>> Directory Manager password: >>>>>>> >>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing >>>>>>> removal >>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': >>>>>>> "Can't >>>>>>> contact LDAP server"} >>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>>>>>> >>>>>>> There were issues removing a connection: %d format: a number is >>>>>>> required, not str >>>>>>> >>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': >>>>>>> "Can't >>>>>>> contact LDAP server"} >>>>>>> >>>>>>> This is a traceback I retrieved: >>>>>>> Traceback (most recent call last): >>>>>>> File "/sbin/ipa-replica-manage", line 425, in del_master >>>>>>> del_link(realm, r, hostname, options.dirman_passwd, force=True) >>>>>>> File "/sbin/ipa-replica-manage", line 271, in del_link >>>>>>> repl1.cleanallruv(replica_id) >>>>>>> File >>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", >>>>>>> line 1094, in cleanallruv >>>>>>> root_logger.debug("Creating CLEANALLRUV task for replica id >>>>>>> %d" % >>>>>>> replicaId) >>>>>>> >>>>>>> >>>>>>> The problem here is that you don't convert replica_id to int in this >>>>>>> part: >>>>>>> + replica_id = None >>>>>>> + if repl2: >>>>>>> + replica_id = repl2._get_replica_id(repl2.conn, None) >>>>>>> + else: >>>>>>> + servers = get_ruv(realm, replica1, dirman_passwd) >>>>>>> + for (netloc, rid) in servers: >>>>>>> + if netloc.startswith(replica2): >>>>>>> + replica_id = rid >>>>>>> + break >>>>>>> >>>>>>> Martin >>>>>>> >>>>>> >>>>>> Updated patch using new mechanism in 389-ds-base. This should more >>>>>> thoroughly clean out RUV data when a replica is being deleted, and >>>>>> provide for a way to delete RUV data afterwards too if necessary. >>>>>> >>>>>> rob >>>>> >>>>> Rebased patch >>>>> >>>>> rob >>>>> >>>> >>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped >>>> elsewhere. >>>> >>>> 1) The following KeyboardInterrupt except class looks suspicious. I >>>> know why >>>> you have it there, but since it is generally a bad thing to do, some >>>> comment >>>> why it is needed would be useful. >>>> >>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, >>>> dirman_passwd, >>>> force=False): >>>> repl1.delete_agreement(replica2) >>>> repl1.delete_referral(replica2) >>>> >>>> + if type1 == replication.IPA_REPLICA: >>>> + if repl2: >>>> + ruv = repl2._get_replica_id(repl2.conn, None) >>>> + else: >>>> + ruv = get_ruv_by_host(realm, replica1, replica2, >>>> dirman_passwd) >>>> + >>>> + try: >>>> + repl1.cleanallruv(ruv) >>>> + except KeyboardInterrupt: >>>> + pass >>>> + >>>> >>>> Maybe you just wanted to do some cleanup and then "raise" again? >>> >>> No, it is there because it is safe to break out of it. The task will >>> continue to run. I added some verbiage. >>> >>>> >>>> 2) This is related to 1), but when some replica is down, >>>> "ipa-replica-manage >>>> del" may wait indefinitely when some remote replica is down, right? >>>> >>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com >>>> Deleting a master is irreversible. >>>> To reconnect to the remote master you will need to prepare a new >>>> replica file >>>> and re-install. >>>> Continue to delete? [no]: y >>>> ipa: INFO: Setting agreement >>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping >>>> >>>> >>>> >>>> tree,cn=config schedule to 2358-2359 0 to force synch >>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement >>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping >>>> >>>> >>>> >>>> tree,cn=config >>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica >>>> acquired >>>> successfully: Incremental update succeeded: start: 0: end: 0 >>>> Background task created to clean replication data >>>> >>>> ... after about a minute I hit CTRL+C >>>> >>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to >>>> 'vm-055.idm.lab.bos.redhat.com' >>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record >>>> does not >>>> contain 'vm-055.idm.lab.bos.redhat.com.' >>>> You may need to manually remove them from the tree >>>> >>>> I think it would be better to inform user that some remote replica is >>>> down or >>>> at least that we are waiting for the task to complete. Something like >>>> that: >>>> >>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com >>>> ... >>>> Background task created to clean replication data >>>> Replication data clean up may take very long time if some replica is >>>> unreachable >>>> Hit CTRL+C to interrupt the wait >>>> ^C Clean up wait interrupted >>>> .... >>>> [continue with del] >>> >>> Yup, did this in #1. >>> >>>> >>>> 3) (minor) When there is a cleanruv task running and you run >>>> "ipa-replica-manage del", there is a unexpected error message with >>>> duplicate >>>> task object in LDAP: >>>> >>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force >>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing >>>> removal >>>> FAIL >>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't >>>> contact LDAP server"} >>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>>> >>>> There were issues removing a connection: This entry already exists >>>> <<<<<<<<< >>>> >>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't >>>> contact LDAP server"} >>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record >>>> does not >>>> contain 'vm-072.idm.lab.bos.redhat.com.' >>>> You may need to manually remove them from the tree >>>> >>>> >>>> I think it should be enough to just catch for "entry already exists" in >>>> cleanallruv function, and in such case print a relevant error message >>>> bail out. >>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too. >>> >>> Good catch, fixed. >>> >>>> >>>> >>>> 4) (minor): In make_readonly function, there is a redundant "pass" >>>> statement: >>>> >>>> + def make_readonly(self): >>>> + """ >>>> + Make the current replication agreement read-only. >>>> + """ >>>> + dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'), >>>> + ('cn', 'plugins'), ('cn', 'config')) >>>> + >>>> + mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')] >>>> + try: >>>> + self.conn.modify_s(dn, mod) >>>> + except ldap.INSUFFICIENT_ACCESS: >>>> + # We can't make the server we're removing read-only but >>>> + # this isn't a show-stopper >>>> + root_logger.debug("No permission to switch replica to >>>> read-only, >>>> continuing anyway") >>>> + pass <<<<<<<<<<<<<<< >>> >>> Yeah, this is one of my common mistakes. I put in a pass initially, then >>> add logging in front of it and forget to delete the pass. Its gone now. >>> >>>> >>>> >>>> 5) In clean_ruv, I think allowing a --force option to bypass the >>>> user_input >>>> would be helpful (at least for test automation): >>>> >>>> + if not ipautil.user_input("Continue to clean?", False): >>>> + sys.exit("Aborted") >>> >>> Yup, added. >>> >>> rob >> >> Slightly revised patch. I still had a window open with one unsaved change. >> >> rob >> > > Apparently there were two unsaved changes, one of which was lost. This adds in > the 'entry already exists' fix. > > rob >
Just one last thing (otherwise the patch is OK) - I don't think this is what we want :-) # ipa-replica-manage clean-ruv 8 Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 Cleaning the wrong replica ID will cause that server to no longer replicate so it may miss updates while the process is running. It would need to be re-initialized to maintain consistency. Be very careful. Continue to clean? [no]: y <<<<<< Aborted Nor this exception, (your are checking for wrong exception): # ipa-replica-manage clean-ruv 8 Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 Cleaning the wrong replica ID will cause that server to no longer replicate so it may miss updates while the process is running. It would need to be re-initialized to maintain consistency. Be very careful. Continue to clean? [no]: unexpected error: This entry already exists This is the exception: Traceback (most recent call last): File "/sbin/ipa-replica-manage", line 651, in <module> main() File "/sbin/ipa-replica-manage", line 648, in main clean_ruv(realm, args[1], options) File "/sbin/ipa-replica-manage", line 373, in clean_ruv thisrepl.cleanallruv(ruv) File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1136, in cleanallruv self.conn.addEntry(e) File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in addEntry self.__handle_errors(e, arg_desc=arg_desc) File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in __handle_errors raise errors.DuplicateEntry() ipalib.errors.DuplicateEntry: This entry already exists Martin _______________________________________________ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel