On Thu, 2012-09-06 at 17:17 -0400, Rob Crittenden wrote: > Martin Kosek wrote: > > On 09/06/2012 05:55 PM, Rob Crittenden wrote: > >> Rob Crittenden wrote: > >>> Rob Crittenden wrote: > >>>> Martin Kosek wrote: > >>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote: > >>>>>> Rob Crittenden wrote: > >>>>>>> Martin Kosek wrote: > >>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote: > >>>>>>>>> Martin Kosek wrote: > >>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote: > >>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the > >>>>>>>>>>> other servers. > >>>>>>>>>>> This can confuse things if the replica is re-added, and it also > >>>>>>>>>>> causes the > >>>>>>>>>>> server to calculate changes against a server that may no longer > >>>>>>>>>>> exist. > >>>>>>>>>>> > >>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all > >>>>>>>>>>> available > >>>>>>>>>>> replicas to clean this RUV data. > >>>>>>>>>>> > >>>>>>>>>>> This patch will create this task at deletion time to hopefully > >>>>>>>>>>> clean things up. > >>>>>>>>>>> > >>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the > >>>>>>>>>>> time > >>>>>>>>>>> the > >>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data > >>>>>>>>>>> may be > >>>>>>>>>>> re-propogated around. > >>>>>>>>>>> > >>>>>>>>>>> To make things easier in this case I've added two new commands to > >>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the > >>>>>>>>>>> servers we > >>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the > >>>>>>>>>>> replication id of a > >>>>>>>>>>> server that no longer exists to try the cleanallruv step again. > >>>>>>>>>>> > >>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a > >>>>>>>>>>> replica id that > >>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in > >>>>>>>>>>> enough scary > >>>>>>>>>>> warnings about this. > >>>>>>>>>>> > >>>>>>>>>>> rob > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Good work there, this should make cleaning RUVs much easier than > >>>>>>>>>> with the > >>>>>>>>>> previous version. > >>>>>>>>>> > >>>>>>>>>> This is what I found during review: > >>>>>>>>>> > >>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I > >>>>>>>>>> think > >>>>>>>>>> it would > >>>>>>>>>> help if we for example have all info for commands indented. This > >>>>>>>>>> way > >>>>>>>>>> user could > >>>>>>>>>> simply over-look the new commands in the man page. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make > >>>>>>>>>> them > >>>>>>>>>> consistent with the rest of the commands (re-initialize, > >>>>>>>>>> force-sync). > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an > >>>>>>>>>> unattended way > >>>>>>>>>> (for better testing), i.e. respect --force option as we already > >>>>>>>>>> do for > >>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the > >>>>>>>>>> future. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 4) (minor) The new question (and the del too) does not react too > >>>>>>>>>> well for > >>>>>>>>>> CTRL+D: > >>>>>>>>>> > >>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force > >>>>>>>>>> Clean the Replication Update Vector for > >>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 > >>>>>>>>>> > >>>>>>>>>> Cleaning the wrong replica ID will cause that server to no > >>>>>>>>>> longer replicate so it may miss updates while the process > >>>>>>>>>> is running. It would need to be re-initialized to maintain > >>>>>>>>>> consistency. Be very careful. > >>>>>>>>>> Continue to clean? [no]: unexpected error: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite > >>>>>>>>>> confusing > >>>>>>>>>> as it reports that command is wrong and not the parameter: > >>>>>>>>>> > >>>>>>>>>> # ipa-replica-manage clean_ruv > >>>>>>>>>> Usage: ipa-replica-manage [options] > >>>>>>>>>> > >>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv | > >>>>>>>>>> force-sync | > >>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv] > >>>>>>>>>> > >>>>>>>>>> It seems you just forgot to specify the error message in the > >>>>>>>>>> command > >>>>>>>>>> definition > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails > >>>>>>>>>> with an > >>>>>>>>>> unexpected error: > >>>>>>>>>> > >>>>>>>>>> [root@vm-086 ~]# ipa-replica-manage clean_ruv 5 > >>>>>>>>>> Clean the Replication Update Vector for > >>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 > >>>>>>>>>> > >>>>>>>>>> Cleaning the wrong replica ID will cause that server to no > >>>>>>>>>> longer replicate so it may miss updates while the process > >>>>>>>>>> is running. It would need to be re-initialized to maintain > >>>>>>>>>> consistency. Be very careful. > >>>>>>>>>> Continue to clean? [no]: y > >>>>>>>>>> unexpected error: {'desc': 'Operations error'} > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors: > >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - > >>>>>>>>>> cleanAllRUV_task: failed > >>>>>>>>>> to connect to repl agreement connection > >>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica, > >>>>>>>>>> > >>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> tree,cn=config), error 105 > >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - > >>>>>>>>>> cleanAllRUV_task: replica > >>>>>>>>>> (cn=meTovm-055.idm.lab. > >>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> tree, cn=config) has not been cleaned. You will need to rerun > >>>>>>>>>> the > >>>>>>>>>> CLEANALLRUV task on this replica. > >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - > >>>>>>>>>> cleanAllRUV_task: Task > >>>>>>>>>> failed (1) > >>>>>>>>>> > >>>>>>>>>> In this case I think we should inform user that the command failed, > >>>>>>>>>> possibly > >>>>>>>>>> because of disconnected replicas and that they could enable the > >>>>>>>>>> replicas and > >>>>>>>>>> try again. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py: > >>>>>>>>>> + except ldap.INSUFFICIENT_ACCESS: > >>>>>>>>>> + # We can't make the server we're removing read-only > >>>>>>>>>> but > >>>>>>>>>> + # this isn't a show-stopper > >>>>>>>>>> + root_logger.debug("No permission to switch replica to > >>>>>>>>>> read-only, > >>>>>>>>>> continuing anyway") > >>>>>>>>>> + pass > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I think this addresses everything. > >>>>>>>>> > >>>>>>>>> rob > >>>>>>>> > >>>>>>>> Thanks, almost there! I just found one more issue which needs to be > >>>>>>>> fixed > >>>>>>>> before we push: > >>>>>>>> > >>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force > >>>>>>>> Directory Manager password: > >>>>>>>> > >>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing > >>>>>>>> removal > >>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': > >>>>>>>> "Can't > >>>>>>>> contact LDAP server"} > >>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' > >>>>>>>> > >>>>>>>> There were issues removing a connection: %d format: a number is > >>>>>>>> required, not str > >>>>>>>> > >>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': > >>>>>>>> "Can't > >>>>>>>> contact LDAP server"} > >>>>>>>> > >>>>>>>> This is a traceback I retrieved: > >>>>>>>> Traceback (most recent call last): > >>>>>>>> File "/sbin/ipa-replica-manage", line 425, in del_master > >>>>>>>> del_link(realm, r, hostname, options.dirman_passwd, > >>>>>>>> force=True) > >>>>>>>> File "/sbin/ipa-replica-manage", line 271, in del_link > >>>>>>>> repl1.cleanallruv(replica_id) > >>>>>>>> File > >>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", > >>>>>>>> line 1094, in cleanallruv > >>>>>>>> root_logger.debug("Creating CLEANALLRUV task for replica id > >>>>>>>> %d" % > >>>>>>>> replicaId) > >>>>>>>> > >>>>>>>> > >>>>>>>> The problem here is that you don't convert replica_id to int in this > >>>>>>>> part: > >>>>>>>> + replica_id = None > >>>>>>>> + if repl2: > >>>>>>>> + replica_id = repl2._get_replica_id(repl2.conn, None) > >>>>>>>> + else: > >>>>>>>> + servers = get_ruv(realm, replica1, dirman_passwd) > >>>>>>>> + for (netloc, rid) in servers: > >>>>>>>> + if netloc.startswith(replica2): > >>>>>>>> + replica_id = rid > >>>>>>>> + break > >>>>>>>> > >>>>>>>> Martin > >>>>>>>> > >>>>>>> > >>>>>>> Updated patch using new mechanism in 389-ds-base. This should more > >>>>>>> thoroughly clean out RUV data when a replica is being deleted, and > >>>>>>> provide for a way to delete RUV data afterwards too if necessary. > >>>>>>> > >>>>>>> rob > >>>>>> > >>>>>> Rebased patch > >>>>>> > >>>>>> rob > >>>>>> > >>>>> > >>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped > >>>>> elsewhere. > >>>>> > >>>>> 1) The following KeyboardInterrupt except class looks suspicious. I > >>>>> know why > >>>>> you have it there, but since it is generally a bad thing to do, some > >>>>> comment > >>>>> why it is needed would be useful. > >>>>> > >>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, > >>>>> dirman_passwd, > >>>>> force=False): > >>>>> repl1.delete_agreement(replica2) > >>>>> repl1.delete_referral(replica2) > >>>>> > >>>>> + if type1 == replication.IPA_REPLICA: > >>>>> + if repl2: > >>>>> + ruv = repl2._get_replica_id(repl2.conn, None) > >>>>> + else: > >>>>> + ruv = get_ruv_by_host(realm, replica1, replica2, > >>>>> dirman_passwd) > >>>>> + > >>>>> + try: > >>>>> + repl1.cleanallruv(ruv) > >>>>> + except KeyboardInterrupt: > >>>>> + pass > >>>>> + > >>>>> > >>>>> Maybe you just wanted to do some cleanup and then "raise" again? > >>>> > >>>> No, it is there because it is safe to break out of it. The task will > >>>> continue to run. I added some verbiage. > >>>> > >>>>> > >>>>> 2) This is related to 1), but when some replica is down, > >>>>> "ipa-replica-manage > >>>>> del" may wait indefinitely when some remote replica is down, right? > >>>>> > >>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com > >>>>> Deleting a master is irreversible. > >>>>> To reconnect to the remote master you will need to prepare a new > >>>>> replica file > >>>>> and re-install. > >>>>> Continue to delete? [no]: y > >>>>> ipa: INFO: Setting agreement > >>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping > >>>>> > >>>>> > >>>>> > >>>>> tree,cn=config schedule to 2358-2359 0 to force synch > >>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement > >>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping > >>>>> > >>>>> > >>>>> > >>>>> tree,cn=config > >>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica > >>>>> acquired > >>>>> successfully: Incremental update succeeded: start: 0: end: 0 > >>>>> Background task created to clean replication data > >>>>> > >>>>> ... after about a minute I hit CTRL+C > >>>>> > >>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to > >>>>> 'vm-055.idm.lab.bos.redhat.com' > >>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record > >>>>> does not > >>>>> contain 'vm-055.idm.lab.bos.redhat.com.' > >>>>> You may need to manually remove them from the tree > >>>>> > >>>>> I think it would be better to inform user that some remote replica is > >>>>> down or > >>>>> at least that we are waiting for the task to complete. Something like > >>>>> that: > >>>>> > >>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com > >>>>> ... > >>>>> Background task created to clean replication data > >>>>> Replication data clean up may take very long time if some replica is > >>>>> unreachable > >>>>> Hit CTRL+C to interrupt the wait > >>>>> ^C Clean up wait interrupted > >>>>> .... > >>>>> [continue with del] > >>>> > >>>> Yup, did this in #1. > >>>> > >>>>> > >>>>> 3) (minor) When there is a cleanruv task running and you run > >>>>> "ipa-replica-manage del", there is a unexpected error message with > >>>>> duplicate > >>>>> task object in LDAP: > >>>>> > >>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force > >>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing > >>>>> removal > >>>>> FAIL > >>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't > >>>>> contact LDAP server"} > >>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' > >>>>> > >>>>> There were issues removing a connection: This entry already exists > >>>>> <<<<<<<<< > >>>>> > >>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't > >>>>> contact LDAP server"} > >>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record > >>>>> does not > >>>>> contain 'vm-072.idm.lab.bos.redhat.com.' > >>>>> You may need to manually remove them from the tree > >>>>> > >>>>> > >>>>> I think it should be enough to just catch for "entry already exists" in > >>>>> cleanallruv function, and in such case print a relevant error message > >>>>> bail out. > >>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too. > >>>> > >>>> Good catch, fixed. > >>>> > >>>>> > >>>>> > >>>>> 4) (minor): In make_readonly function, there is a redundant "pass" > >>>>> statement: > >>>>> > >>>>> + def make_readonly(self): > >>>>> + """ > >>>>> + Make the current replication agreement read-only. > >>>>> + """ > >>>>> + dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'), > >>>>> + ('cn', 'plugins'), ('cn', 'config')) > >>>>> + > >>>>> + mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')] > >>>>> + try: > >>>>> + self.conn.modify_s(dn, mod) > >>>>> + except ldap.INSUFFICIENT_ACCESS: > >>>>> + # We can't make the server we're removing read-only but > >>>>> + # this isn't a show-stopper > >>>>> + root_logger.debug("No permission to switch replica to > >>>>> read-only, > >>>>> continuing anyway") > >>>>> + pass <<<<<<<<<<<<<<< > >>>> > >>>> Yeah, this is one of my common mistakes. I put in a pass initially, then > >>>> add logging in front of it and forget to delete the pass. Its gone now. > >>>> > >>>>> > >>>>> > >>>>> 5) In clean_ruv, I think allowing a --force option to bypass the > >>>>> user_input > >>>>> would be helpful (at least for test automation): > >>>>> > >>>>> + if not ipautil.user_input("Continue to clean?", False): > >>>>> + sys.exit("Aborted") > >>>> > >>>> Yup, added. > >>>> > >>>> rob > >>> > >>> Slightly revised patch. I still had a window open with one unsaved change. > >>> > >>> rob > >>> > >> > >> Apparently there were two unsaved changes, one of which was lost. This > >> adds in > >> the 'entry already exists' fix. > >> > >> rob > >> > > > > Just one last thing (otherwise the patch is OK) - I don't think this is > > what we > > want :-) > > > > # ipa-replica-manage clean-ruv 8 > > Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 > > > > Cleaning the wrong replica ID will cause that server to no > > longer replicate so it may miss updates while the process > > is running. It would need to be re-initialized to maintain > > consistency. Be very careful. > > Continue to clean? [no]: y <<<<<< > > Aborted > > > > > > Nor this exception, (your are checking for wrong exception): > > > > # ipa-replica-manage clean-ruv 8 > > Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 > > > > Cleaning the wrong replica ID will cause that server to no > > longer replicate so it may miss updates while the process > > is running. It would need to be re-initialized to maintain > > consistency. Be very careful. > > Continue to clean? [no]: > > unexpected error: This entry already exists > > > > This is the exception: > > > > Traceback (most recent call last): > > File "/sbin/ipa-replica-manage", line 651, in <module> > > main() > > File "/sbin/ipa-replica-manage", line 648, in main > > clean_ruv(realm, args[1], options) > > File "/sbin/ipa-replica-manage", line 373, in clean_ruv > > thisrepl.cleanallruv(ruv) > > File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", > > line 1136, in cleanallruv > > self.conn.addEntry(e) > > File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, > > in > > addEntry > > self.__handle_errors(e, arg_desc=arg_desc) > > File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, > > in > > __handle_errors > > raise errors.DuplicateEntry() > > ipalib.errors.DuplicateEntry: This entry already exists > > > > Martin > > > > Fixed that and a couple of other problems. When doing a disconnect we > should not also call clean-ruv. > > I also got tired of seeing crappy error messages so I added a little > convert utility. > > rob
ACK. Do you want to want with the pushing until the winsync issue in CLEANALLRUV is fixed? This is the ticket I created: https://fedorahosted.org/389/ticket/450 Martin _______________________________________________ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel