Re: [Freeipa-devel] [PATCH] 1031 run cleanallruv task

Rich Megginson Thu, 06 Sep 2012 09:46:14 -0700

On 09/06/2012 10:40 AM, Mark Reynolds wrote:

On 09/06/2012 12:13 PM, Rich Megginson wrote:
On 09/06/2012 10:09 AM, Martin Kosek wrote:
On 09/06/2012 06:09 PM, Martin Kosek wrote:
On 09/06/2012 06:05 PM, Martin Kosek wrote:
On 09/06/2012 05:55 PM, Rob Crittenden wrote:
Rob Crittenden wrote:
Rob Crittenden wrote:
Martin Kosek wrote:
On 09/05/2012 08:06 PM, Rob Crittenden wrote:
Rob Crittenden wrote:
Martin Kosek wrote:
On 07/05/2012 08:39 PM, Rob Crittenden wrote:
Martin Kosek wrote:
On 07/03/2012 04:41 PM, Rob Crittenden wrote:
Deleting a replica can leave a replication vector (RUV)on the
other servers.
This can confuse things if the replica is re-added, andit also
causes the
server to calculate changes against a server that may nolonger
exist.
389-ds-base provides a new task that self-propogatesitself to all
available
replicas to clean this RUV data.
This patch will create this task at deletion time tohopefully
clean things up.
It isn't perfect. If any replica is down or unavailableat the
time
the
cleanruv task fires, and then comes back up, the old RUVdata
may be
re-propogated around.
To make things easier in this case I've added two newcommands toipa-replica-manage. The first lists the replication idsof all the
servers we
have a RUV for. Using this you can call clean_ruv with the
replication id of a
server that no longer exists to try the cleanallruv stepagain.
This is quite dangerous though. If you run cleanruvagainst a
replica id that
does exist it can cause a loss of data. I believe I'veput in
enough scary
warnings about this.

rob
Good work there, this should make cleaning RUVs mucheasier than
with the
previous version.

This is what I found during review:
1) list_ruv and clean_ruv command help in man is quitelost. I
think
it would
help if we for example have all info for commandsindented. This
way
user could
simply over-look the new commands in the man page.
2) I would rename new commands to clean-ruv and list-ruvto make
them
consistent with the rest of the commands (re-initialize,
force-sync).
3) It would be nice to be able to run clean_ruv commandin an
unattended way
(for better testing), i.e. respect --force option as wealready
do for
ipa-replica-manage del. This fix would aid testautomation in the
future.
4) (minor) The new question (and the del too) does notreact too
well for
CTRL+D:

# ipa-replica-manage clean_ruv 3 --force
Clean the Replication Update Vector for
vm-055.idm.lab.bos.redhat.com:389

Cleaning the wrong replica ID will cause that server to no
longer replicate so it may miss updates while the process
is running. It would need to be re-initialized to maintain
consistency. Be very careful.
Continue to clean? [no]: unexpected error:
5) Help for clean_ruv command without a requiredparameter is quite
confusing
as it reports that command is wrong and not the parameter:

# ipa-replica-manage clean_ruv
Usage: ipa-replica-manage [options]
ipa-replica-manage: error: must provide a command[clean_ruv |
force-sync |
disconnect | connect | del | re-initialize | list |list_ruv]
It seems you just forgot to specify the error message in the
command
definition
6) When the remote replica is down, the clean_ruv commandfails
with an
unexpected error:

[root@vm-086 ~]# ipa-replica-manage clean_ruv 5
Clean the Replication Update Vector for
vm-055.idm.lab.bos.redhat.com:389

Cleaning the wrong replica ID will cause that server to no
longer replicate so it may miss updates while the process
is running. It would need to be re-initialized to maintain
consistency. Be very careful.
Continue to clean? [no]: y
unexpected error: {'desc': 'Operations error'}


/var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
[04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
cleanAllRUV_task: failed
to connect to repl        agreement connection
(cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
tree,cn=config), error 105
[04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
cleanAllRUV_task: replica
(cn=meTovm-055.idm.lab.
bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
tree, cn=config) has not been cleaned. You will needto rerun
the
CLEANALLRUV task on this replica.
[04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
cleanAllRUV_task: Task
failed (1)
In this case I think we should inform user that thecommand failed,
possibly
because of disconnected replicas and that they couldenable the
replicas and
try again.


7) (minor) "pass" is now redundant in replication.py:
+        except ldap.INSUFFICIENT_ACCESS:
+ # We can't make the server we're removingread-only
but
+            # this isn't a show-stopper
+ root_logger.debug("No permission to switchreplica to
read-only,
continuing anyway")
+            pass
I think this addresses everything.

rob
Thanks, almost there! I just found one more issue whichneeds to be
fixed
before we push:

# ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
Directory Manager password:
Unable to connect to replica vm-055.idm.lab.bos.redhat.com,forcing
removal
Failed to get data from 'vm-055.idm.lab.bos.redhat.com':{'desc':
"Can't
contact LDAP server"}
Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
There were issues removing a connection: %d format: anumber is
required, not str
Failed to get data from 'vm-055.idm.lab.bos.redhat.com':{'desc':
"Can't
contact LDAP server"}

This is a traceback I retrieved:
Traceback (most recent call last):
     File "/sbin/ipa-replica-manage", line 425, in del_master
del_link(realm, r, hostname, options.dirman_passwd,force=True)
     File "/sbin/ipa-replica-manage", line 271, in del_link
       repl1.cleanallruv(replica_id)
     File
"/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
line 1094, in cleanallruv
root_logger.debug("Creating CLEANALLRUV task forreplica id
%d" %
replicaId)
The problem here is that you don't convert replica_id toint in this
part:
+    replica_id = None
+    if repl2:
+        replica_id = repl2._get_replica_id(repl2.conn, None)
+    else:
+        servers = get_ruv(realm, replica1, dirman_passwd)
+        for (netloc, rid) in servers:
+            if netloc.startswith(replica2):
+                replica_id = rid
+                break

Martin
Updated patch using new mechanism in 389-ds-base. Thisshould morethoroughly clean out RUV data when a replica is beingdeleted, andprovide for a way to delete RUV data afterwards too ifnecessary.
rob
Rebased patch

rob
0) As I wrote in a review for your patch 1041, changelog entryslipped
elsewhere.
1) The following KeyboardInterrupt except class lookssuspicious. I
know why
you have it there, but since it is generally a bad thing todo, some
comment
why it is needed would be useful.

@@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
dirman_passwd,
force=False):
       repl1.delete_agreement(replica2)
       repl1.delete_referral(replica2)

+    if type1 == replication.IPA_REPLICA:
+        if repl2:
+            ruv = repl2._get_replica_id(repl2.conn, None)
+        else:
+            ruv = get_ruv_by_host(realm, replica1, replica2,
dirman_passwd)
+
+        try:
+            repl1.cleanallruv(ruv)
+        except KeyboardInterrupt:
+            pass
+

Maybe you just wanted to do some cleanup and then "raise" again?
No, it is there because it is safe to break out of it. The taskwill
continue to run. I added some verbiage.
2) This is related to 1), but when some replica is down,
"ipa-replica-manage
del" may wait indefinitely when some remote replica is down,right?
# ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
Deleting a master is irreversible.
To reconnect to the remote master you will need to prepare a new
replica file
and re-install.
Continue to delete? [no]: y
ipa: INFO: Setting agreement
cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
tree,cn=config schedule to 2358-2359 0 to force synch
ipa: INFO: Deleting schedule 2358-2359 0 from agreement
cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
tree,cn=config
ipa: INFO: Replication Update in progress: FALSE: status: 0Replica
acquired
successfully: Incremental update succeeded: start: 0: end: 0
Background task created to clean replication data

... after about a minute I hit CTRL+C
^CDeleted replication agreement from'vm-086.idm.lab.bos.redhat.com' to
'vm-055.idm.lab.bos.redhat.com'
Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries:NS record
does not
contain 'vm-055.idm.lab.bos.redhat.com.'
You may need to manually remove them from the tree
I think it would be better to inform user that some remotereplica is
down or
at least that we are waiting for the task to complete.Something like
that:

# ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
...
Background task created to clean replication data
Replication data clean up may take very long time if somereplica is
unreachable
Hit CTRL+C to interrupt the wait
^C Clean up wait interrupted
....
[continue with del]
Yup, did this in #1.
3) (minor) When there is a cleanruv task running and you run
"ipa-replica-manage del", there is a unexpected error messagewith
duplicate
task object in LDAP:

# ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
Unable to connect to replica vm-072.idm.lab.bos.redhat.com,forcing
removal
FAIL
Failed to get data from 'vm-072.idm.lab.bos.redhat.com':{'desc': "Can't
contact LDAP server"}
Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
There were issues removing a connection: This entry alreadyexists
<<<<<<<<<
Failed to get data from 'vm-072.idm.lab.bos.redhat.com':{'desc': "Can't
contact LDAP server"}
Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries:NS record
does not
contain 'vm-072.idm.lab.bos.redhat.com.'
You may need to manually remove them from the tree
I think it should be enough to just catch for "entry alreadyexists" incleanallruv function, and in such case print a relevant errormessage
bail out.
Thus, self.conn.checkTask(dn, dowait=True) would not be calledtoo.
Good catch, fixed.
4) (minor): In make_readonly function, there is a redundant"pass"
statement:

+    def make_readonly(self):
+        """
+        Make the current replication agreement read-only.
+        """
+        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
+                ('cn', 'plugins'), ('cn', 'config'))
+
+        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
+        try:
+            self.conn.modify_s(dn, mod)
+        except ldap.INSUFFICIENT_ACCESS:
+ # We can't make the server we're removingread-only but
+            # this isn't a show-stopper
+ root_logger.debug("No permission to switchreplica to
read-only,
continuing anyway")
+            pass<<<<<<<<<<<<<<<
Yeah, this is one of my common mistakes. I put in a passinitially, thenadd logging in front of it and forget to delete the pass. Itsgone now.
5) In clean_ruv, I think allowing a --force option to bypass the
user_input
would be helpful (at least for test automation):

+    if not ipautil.user_input("Continue to clean?", False):
+        sys.exit("Aborted")
Yup, added.

rob
Slightly revised patch. I still had a window open with oneunsaved change.
rob
Apparently there were two unsaved changes, one of which was lost.This adds in
the 'entry already exists' fix.

rob
Just one last thing (otherwise the patch is OK) - I don't thinkthis is what we
want :-)

# ipa-replica-manage clean-ruv 8
Clean the Replication Update Vector forvm-055.idm.lab.bos.redhat.com:389
Cleaning the wrong replica ID will cause that server to no
longer replicate so it may miss updates while the process
is running. It would need to be re-initialized to maintain
consistency. Be very careful.
Continue to clean? [no]: y<<<<<<
Aborted


Nor this exception, (your are checking for wrong exception):

# ipa-replica-manage clean-ruv 8
Clean the Replication Update Vector forvm-055.idm.lab.bos.redhat.com:389
Cleaning the wrong replica ID will cause that server to no
longer replicate so it may miss updates while the process
is running. It would need to be re-initialized to maintain
consistency. Be very careful.
Continue to clean? [no]:
unexpected error: This entry already exists

This is the exception:

Traceback (most recent call last):
   File "/sbin/ipa-replica-manage", line 651, in<module>
     main()
   File "/sbin/ipa-replica-manage", line 648, in main
     clean_ruv(realm, args[1], options)
   File "/sbin/ipa-replica-manage", line 373, in clean_ruv
     thisrepl.cleanallruv(ruv)
File"/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
line 1136, in cleanallruv
     self.conn.addEntry(e)
File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py",line 503, in
addEntry
     self.__handle_errors(e, arg_desc=arg_desc)
File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py",line 321, in
__handle_errors
     raise errors.DuplicateEntry()
ipalib.errors.DuplicateEntry: This entry already exists

Martin
On another matter, I just noticed that CLEANRUV is not proceedingif I have a
winsync replica defined (and it is even up):

# ipa-replica-manage list
dc.ad.test: winsync<<<<<<<
vm-072.idm.lab.bos.redhat.com: master
vm-086.idm.lab.bos.redhat.com: master
[06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Waiting
for all the replicas to receive all the deleted replica updates...
[06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Failedto contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10),will retry later.[06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Not all
replicas caught up, retrying in 10 seconds
[06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Failedto contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10),will retry later.[06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Not all
replicas caught up, retrying in 20 seconds
[06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Failedto contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10),will retry later.[06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUVTask: Not all
replicas caught up, retrying in 40 seconds

I don't think this is OK. Adding Rich to CC to follow on this one.

Martin
And now the actual CC.
Martin
Yeah - looks like CLEANALLRUV needs to ignore windows sync agreements
Yeah, sorry not that familiar with winsync(didn't know it used thesame repl agmts). I will have to do another fix...

winsync repl agreements have both
objectclass: nsds5replicationagreement
objectclass: nsDSWindowsReplicationAgreement


_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Re: [Freeipa-devel] [PATCH] 1031 run cleanallruv task

Reply via email to