On 09/05/2012 08:06 PM, Rob Crittenden wrote: > Rob Crittenden wrote: >> Martin Kosek wrote: >>> On 07/05/2012 08:39 PM, Rob Crittenden wrote: >>>> Martin Kosek wrote: >>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote: >>>>>> Deleting a replica can leave a replication vector (RUV) on the >>>>>> other servers. >>>>>> This can confuse things if the replica is re-added, and it also >>>>>> causes the >>>>>> server to calculate changes against a server that may no longer exist. >>>>>> >>>>>> 389-ds-base provides a new task that self-propogates itself to all >>>>>> available >>>>>> replicas to clean this RUV data. >>>>>> >>>>>> This patch will create this task at deletion time to hopefully >>>>>> clean things up. >>>>>> >>>>>> It isn't perfect. If any replica is down or unavailable at the time >>>>>> the >>>>>> cleanruv task fires, and then comes back up, the old RUV data may be >>>>>> re-propogated around. >>>>>> >>>>>> To make things easier in this case I've added two new commands to >>>>>> ipa-replica-manage. The first lists the replication ids of all the >>>>>> servers we >>>>>> have a RUV for. Using this you can call clean_ruv with the >>>>>> replication id of a >>>>>> server that no longer exists to try the cleanallruv step again. >>>>>> >>>>>> This is quite dangerous though. If you run cleanruv against a >>>>>> replica id that >>>>>> does exist it can cause a loss of data. I believe I've put in >>>>>> enough scary >>>>>> warnings about this. >>>>>> >>>>>> rob >>>>>> >>>>> >>>>> Good work there, this should make cleaning RUVs much easier than >>>>> with the >>>>> previous version. >>>>> >>>>> This is what I found during review: >>>>> >>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I think >>>>> it would >>>>> help if we for example have all info for commands indented. This way >>>>> user could >>>>> simply over-look the new commands in the man page. >>>>> >>>>> >>>>> 2) I would rename new commands to clean-ruv and list-ruv to make them >>>>> consistent with the rest of the commands (re-initialize, force-sync). >>>>> >>>>> >>>>> 3) It would be nice to be able to run clean_ruv command in an >>>>> unattended way >>>>> (for better testing), i.e. respect --force option as we already do for >>>>> ipa-replica-manage del. This fix would aid test automation in the >>>>> future. >>>>> >>>>> >>>>> 4) (minor) The new question (and the del too) does not react too >>>>> well for >>>>> CTRL+D: >>>>> >>>>> # ipa-replica-manage clean_ruv 3 --force >>>>> Clean the Replication Update Vector for >>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>> >>>>> Cleaning the wrong replica ID will cause that server to no >>>>> longer replicate so it may miss updates while the process >>>>> is running. It would need to be re-initialized to maintain >>>>> consistency. Be very careful. >>>>> Continue to clean? [no]: unexpected error: >>>>> >>>>> >>>>> 5) Help for clean_ruv command without a required parameter is quite >>>>> confusing >>>>> as it reports that command is wrong and not the parameter: >>>>> >>>>> # ipa-replica-manage clean_ruv >>>>> Usage: ipa-replica-manage [options] >>>>> >>>>> ipa-replica-manage: error: must provide a command [clean_ruv | >>>>> force-sync | >>>>> disconnect | connect | del | re-initialize | list | list_ruv] >>>>> >>>>> It seems you just forgot to specify the error message in the command >>>>> definition >>>>> >>>>> >>>>> 6) When the remote replica is down, the clean_ruv command fails with an >>>>> unexpected error: >>>>> >>>>> [root@vm-086 ~]# ipa-replica-manage clean_ruv 5 >>>>> Clean the Replication Update Vector for >>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>> >>>>> Cleaning the wrong replica ID will cause that server to no >>>>> longer replicate so it may miss updates while the process >>>>> is running. It would need to be re-initialized to maintain >>>>> consistency. Be very careful. >>>>> Continue to clean? [no]: y >>>>> unexpected error: {'desc': 'Operations error'} >>>>> >>>>> >>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors: >>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>> cleanAllRUV_task: failed >>>>> to connect to repl agreement connection >>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica, >>>>> >>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>> tree,cn=config), error 105 >>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>> cleanAllRUV_task: replica >>>>> (cn=meTovm-055.idm.lab. >>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>> >>>>> >>>>> >>>>> tree, cn=config) has not been cleaned. You will need to rerun the >>>>> CLEANALLRUV task on this replica. >>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>> cleanAllRUV_task: Task >>>>> failed (1) >>>>> >>>>> In this case I think we should inform user that the command failed, >>>>> possibly >>>>> because of disconnected replicas and that they could enable the >>>>> replicas and >>>>> try again. >>>>> >>>>> >>>>> 7) (minor) "pass" is now redundant in replication.py: >>>>> + except ldap.INSUFFICIENT_ACCESS: >>>>> + # We can't make the server we're removing read-only but >>>>> + # this isn't a show-stopper >>>>> + root_logger.debug("No permission to switch replica to >>>>> read-only, >>>>> continuing anyway") >>>>> + pass >>>>> >>>> >>>> I think this addresses everything. >>>> >>>> rob >>> >>> Thanks, almost there! I just found one more issue which needs to be fixed >>> before we push: >>> >>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force >>> Directory Manager password: >>> >>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing >>> removal >>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': "Can't >>> contact LDAP server"} >>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>> >>> There were issues removing a connection: %d format: a number is >>> required, not str >>> >>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': "Can't >>> contact LDAP server"} >>> >>> This is a traceback I retrieved: >>> Traceback (most recent call last): >>> File "/sbin/ipa-replica-manage", line 425, in del_master >>> del_link(realm, r, hostname, options.dirman_passwd, force=True) >>> File "/sbin/ipa-replica-manage", line 271, in del_link >>> repl1.cleanallruv(replica_id) >>> File >>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", >>> line 1094, in cleanallruv >>> root_logger.debug("Creating CLEANALLRUV task for replica id %d" % >>> replicaId) >>> >>> >>> The problem here is that you don't convert replica_id to int in this >>> part: >>> + replica_id = None >>> + if repl2: >>> + replica_id = repl2._get_replica_id(repl2.conn, None) >>> + else: >>> + servers = get_ruv(realm, replica1, dirman_passwd) >>> + for (netloc, rid) in servers: >>> + if netloc.startswith(replica2): >>> + replica_id = rid >>> + break >>> >>> Martin >>> >> >> Updated patch using new mechanism in 389-ds-base. This should more >> thoroughly clean out RUV data when a replica is being deleted, and >> provide for a way to delete RUV data afterwards too if necessary. >> >> rob > > Rebased patch > > rob >
0) As I wrote in a review for your patch 1041, changelog entry slipped elsewhere. 1) The following KeyboardInterrupt except class looks suspicious. I know why you have it there, but since it is generally a bad thing to do, some comment why it is needed would be useful. @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, dirman_passwd, force=False): repl1.delete_agreement(replica2) repl1.delete_referral(replica2) + if type1 == replication.IPA_REPLICA: + if repl2: + ruv = repl2._get_replica_id(repl2.conn, None) + else: + ruv = get_ruv_by_host(realm, replica1, replica2, dirman_passwd) + + try: + repl1.cleanallruv(ruv) + except KeyboardInterrupt: + pass + Maybe you just wanted to do some cleanup and then "raise" again? 2) This is related to 1), but when some replica is down, "ipa-replica-manage del" may wait indefinitely when some remote replica is down, right? # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com Deleting a master is irreversible. To reconnect to the remote master you will need to prepare a new replica file and re-install. Continue to delete? [no]: y ipa: INFO: Setting agreement cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping tree,cn=config schedule to 2358-2359 0 to force synch ipa: INFO: Deleting schedule 2358-2359 0 from agreement cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping tree,cn=config ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica acquired successfully: Incremental update succeeded: start: 0: end: 0 Background task created to clean replication data ... after about a minute I hit CTRL+C ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to 'vm-055.idm.lab.bos.redhat.com' Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record does not contain 'vm-055.idm.lab.bos.redhat.com.' You may need to manually remove them from the tree I think it would be better to inform user that some remote replica is down or at least that we are waiting for the task to complete. Something like that: # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com ... Background task created to clean replication data Replication data clean up may take very long time if some replica is unreachable Hit CTRL+C to interrupt the wait ^C Clean up wait interrupted .... [continue with del] 3) (minor) When there is a cleanruv task running and you run "ipa-replica-manage del", there is a unexpected error message with duplicate task object in LDAP: # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing removal FAIL Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't contact LDAP server"} Forcing removal on 'vm-086.idm.lab.bos.redhat.com' There were issues removing a connection: This entry already exists <<<<<<<<< Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't contact LDAP server"} Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record does not contain 'vm-072.idm.lab.bos.redhat.com.' You may need to manually remove them from the tree I think it should be enough to just catch for "entry already exists" in cleanallruv function, and in such case print a relevant error message bail out. Thus, self.conn.checkTask(dn, dowait=True) would not be called too. 4) (minor): In make_readonly function, there is a redundant "pass" statement: + def make_readonly(self): + """ + Make the current replication agreement read-only. + """ + dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'), + ('cn', 'plugins'), ('cn', 'config')) + + mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')] + try: + self.conn.modify_s(dn, mod) + except ldap.INSUFFICIENT_ACCESS: + # We can't make the server we're removing read-only but + # this isn't a show-stopper + root_logger.debug("No permission to switch replica to read-only, continuing anyway") + pass <<<<<<<<<<<<<<< 5) In clean_ruv, I think allowing a --force option to bypass the user_input would be helpful (at least for test automation): + if not ipautil.user_input("Continue to clean?", False): + sys.exit("Aborted") Martin _______________________________________________ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel