Re: [Openais] cpg behavior on transitional membership change
On Fri, Sep 02, 2011 at 10:30:53AM -0700, Steven Dake wrote: On 09/02/2011 12:59 AM, Vladislav Bogdanov wrote: Hi all, I'm trying to further investigate problem I described at https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html The main problem for me there is that pacemaker first sees transitional membership with left nodes, then it sees stable membership with that nodes returned back, and does nothing about that. On the other hand, dlm_controld sees CPG_REASON_NODEDOWN events on CPGs related to all its lockspaces (at the same time with transitional membership change) and stops kernel part of each lockspace until whole cluster is rebooted (or until some other recovery procedure which unfortunately does not happen I believe fenced should reboot the node, but only if there is quorum. It is possible your cluster has lost quorum during this series of events. I have copied Dave for his feedback on this point. I really can't make any sense of the report, sorry. Maybe reproduce it without pacemaker, and then describe the specific steps to create the issue and resulting symptoms. After that we can determine what logs, if any, would be useful. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] Announcing Corosync 1.3.0
On Thu, Jan 13, 2011 at 08:09:13AM -0700, Steven Dake wrote: On 01/13/2011 08:03 AM, Lars Marowsky-Bree wrote: On 2010-12-01T14:18:25, Steven Dake sd...@redhat.com wrote: Corosync 1.3.0 is available for immediate download from our website. This version brings many enhancements to the software. The two most visible enhancements are UDPU transport mode and the cpg_model_initialize api call. The UDPU transport omde allows Corosync to run over basic UDP transport without the need for multicast support in the cluster switching environment. The API addition allows for correct operation of cluster file systems such as gfs2 and ocfs2. Hi Steven, can you elaborate please - how are current cluster file system implementations not correct, or is this a mere API enhancement? Dave has more details, but we needed to add an API to give ring id information to fenced to prevent fenced from entering a stuck state. The conditions leading up to the problem are difficult for me to recall, but it did happen in community testing as well as internal validation. I recommend pinging dct offline if you need more information. (There is a fenced patch that goes with this, and perhaps something else). Steve is explaining the addition of cpg_totem_confchg_fn: http://www.corosync.org/git/?p=corosync.git;a=commitdiff;h=e8b143595cd3b3827c044164873c7825bc65b726 which I used here: http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=6dd65cf344d05730cbccb99ce5265e84f762bfde after asking for here: https://lists.linux-foundation.org/pipermail/openais/2009-September/013022.html I wouldn't explain it in terms of clusterfs's or fenced. It's really addressing a shortcoming in the corosync API's. The problem was that it was impossible to correlate a callback from the cpg library with a callback from another library for the same underlying event. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] [TOTEM ] A processor joined or left the membership and a new membership was formed.
I'm always looking for ways to make debugging/diagnosing corosync easier since it's notoriously difficult. I've always just ignored the messages in the subject line; they seem more or less equivalent to something happened. (The length of corosync messages tend to be inversely proportional to their usefulness.) Is there some information we could put in those messages to make them useful? I was recently looking at ring id's in my app, and found it would be helpful to correlate with what appeared in /var/log/messages, but there are no ring id's there. Would this message or another be a sensible place to put a ring id? Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [PATCH orosync] select a new sync member if the node with the lowest nodeid has left.
On Thu, Apr 22, 2010 at 11:06:19AM +1000, Angus Salkeld wrote: Problem: Under certain circumstances cpg does not send group leave messages. With a big token timeout (tested with token == 5min). 1 start all nodes 2 start ./test/testcpg on all nodes 2 go to the node with the lowest nodeid 3 ifconfig int down killall -9 corosync /etc/init.d/corosync restart ./testcpg 4 the other nodes will not get the cpg leave event 5 testcpg reports an extra cpg group (basically one was not removed) Solution: If a member gets removed using the new trans_list and that member is the node used for syncing (lowest nodeid) then the next lowest node needs to be chosen for syncing. David would you mind confirming that this solves your problem? It works great, thanks! Dave -Angus Signed-off-by: Angus Salkeld asalk...@redhat.com --- services/cpg.c | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/services/cpg.c b/services/cpg.c index ede426f..e9926ac 100644 --- a/services/cpg.c +++ b/services/cpg.c @@ -414,6 +414,27 @@ struct req_exec_cpg_downlist { static struct req_exec_cpg_downlist g_req_exec_cpg_downlist; +static int memb_list_remove_value (unsigned int *list, + size_t list_entries, int value) +{ + int j; + int found = 0; + + for (j = 0; j list_entries; j++) { + if (list[j] == value) { + /* mark next values to be copied down */ + found = 1; + } + else if (found) { + list[j-1] = list[j]; + } + } + if (found) + return (list_entries - 1); + else + return list_entries; +} + static void cpg_sync_init_v2 ( const unsigned int *trans_list, size_t trans_list_entries, @@ -432,6 +453,21 @@ static void cpg_sync_init_v2 ( sizeof (unsigned int)); my_member_list_entries = member_list_entries; + for (i = 0; i my_old_member_list_entries; i++) { + found = 0; + for (j = 0; j trans_list_entries; j++) { + if (my_old_member_list[i] == trans_list[j]) { + found = 1; + break; + } + } + if (found == 0) { + my_member_list_entries = memb_list_remove_value ( + my_member_list, my_member_list_entries, + my_old_member_list[i]); + } + } + for (i = 0; i my_member_list_entries; i++) { if (my_member_list[i] lowest_nodeid) { lowest_nodeid = my_member_list[i]; -- 1.6.6.1 ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [PATCH orosync] select a new sync member if the node with the lowest nodeid has left.
On Thu, Apr 22, 2010 at 04:35:08PM -0500, David Teigland wrote: On Thu, Apr 22, 2010 at 11:06:19AM +1000, Angus Salkeld wrote: Problem: Under certain circumstances cpg does not send group leave messages. With a big token timeout (tested with token == 5min). 1 start all nodes 2 start ./test/testcpg on all nodes 2 go to the node with the lowest nodeid 3 ifconfig int down killall -9 corosync /etc/init.d/corosync restart ./testcpg 4 the other nodes will not get the cpg leave event 5 testcpg reports an extra cpg group (basically one was not removed) Solution: If a member gets removed using the new trans_list and that member is the node used for syncing (lowest nodeid) then the next lowest node needs to be chosen for syncing. David would you mind confirming that this solves your problem? It works great, thanks! That was after two tests, and it may have been a bit hasty... when I went back to do some further tests, I happened to make a slight mistake running the usual steps, and the node failure then went unnoticed like before. When repeating the mistake intentionally, I get the same problem. This new test is: 1 nodes 1,2,3,4: cman_tool join 2 create iptables partition: 1 | 2,3,4 3 node 1: kill -9 corosync 4 remove iptables partition: 1,2,3,4 5 node 1: cman_tool join 6 nodes 1,2,3,4: fenced; fence_tool join 7 create iptables partition: 1 | 2,3,4 8 node 1: kill -9 corosync 9 remove iptables partition: 1,2,3,4 10 node 1: cman_tool join 11 no confchg removing 1 from the fenced cpg on nodes 2,3,4 Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] segfault in objdb
I'm using trunk svnversion 2770. I ran 'service cman start' on four nodes, which I do all the time, and one segfaulted here, Core was generated by `corosync -f'. Program terminated with signal 11, Segmentation fault. #0 0x7f1437774eb9 in object_find_next ( object_find_handle=4760538031444721676, object_handle=0x7f1434031b78) at objdb.c:889 889 ((object_instance-object_name_len == Missing separate debuginfos, use: debuginfo-install corosync-1.2.0-1.fc12.x86_64 (gdb) bt #0 0x7f1437774eb9 in object_find_next ( object_find_handle=4760538031444721676, object_handle=0x7f1434031b78) at objdb.c:889 #1 0x7f143438c999 in message_handler_req_lib_confdb_object_find ( conn=0xe75b90, message=0x7f142f9ff000) at confdb.c:697 #2 0x7f143813b3af in pthread_ipc_consumer (conn=0xe75b90) at coroipcs.c:701 #3 0x003065206a3a in start_thread () from /lib64/libpthread.so.0 #4 0x003064ade67d in clone () from /lib64/libc.so.6 #5 0x in ?? () ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG model_init + callback with totem ringid and members
On Thu, Apr 08, 2010 at 04:57:22PM +0200, Jan Friesse wrote: commit 0d509f4bf23f618c940c3bcdd7cf0e97faf64876 Author: Jan Friesse jfrie...@redhat.com Date: Thu Apr 8 16:48:45 2010 +0200 CPG model_initialize and ringid + members callback Patch adds new function to initialize cpg, cpg_model_initialize. Model is set of callbacks. With this function, future addions of models should be possible without changing the ABI. Patch also contains callback in CPG_MODEL_V1 for notification about Totem membership changes. I've been doing extensive testing with this patch, and it's working well (2010-04-08-cpg_model+totem_cb.patch); ack from me on going ahead with it. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] stuck on sem_timedwait
On Wed, Apr 14, 2010 at 12:57:14PM +0200, Jan Friesse wrote: David, in that case, corosync exits (so it is really not running) or not? Yep, the corosync process is gone. David Teigland wrote: When corosync exits, my application (fenced) gets stuck. # strace -p 2005 Process 2005 attached - interrupt to quit restart_syscall(... resuming interrupted call ...) = -1 ETIMEDOUT (Connection timed out) poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}]) gettimeofday({1271185487, 264}, NULL) = 0 futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185489, 0} , ) = -1 ETIMEDOUT (Connection timed out) poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}]) gettimeofday({1271185489, 198}, NULL) = 0 futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185491, 0}, ) = -1 ETIMEDOUT (Connection timed out) 0x00338d00d417 in sem_timedwait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install cman-3.0.7-1.fc12.x86_64 (gdb) bt #0 0x00338d00d417 in sem_timedwait () from /lib64/libpthread.so.0 #1 0x003713e02311 in reply_receive (ipc_instance=0x2379ed0, res_msg=0x768c6a50, res_len=16) at coroipcc.c:476 #2 0x003713e02e7e in coroipcc_msg_send_reply_receive ( handle=3265522690949120001, iov=0x768c6a80, iov_len=1, res_msg=0x768c6a50, res_len=16) at coroipcc.c:1045 #3 0x003713a01ed3 in cpg_finalize (handle=5902762718137417729) at cpg.c:238 #4 0x00403542 in close_cpg_daemon () at /root/stable3/fence/fenced/cpg.c:2311 #5 0x0040b26d in loop (argc=value optimized out, argv=value optimized out) at /root/stable3/fence/fenced/main.c:831 #6 main (argc=value optimized out, argv=value optimized out) at /root/stable3/fence/fenced/main.c:1045 ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] stuck on sem_timedwait
When corosync exits, my application (fenced) gets stuck. # strace -p 2005 Process 2005 attached - interrupt to quit restart_syscall(... resuming interrupted call ...) = -1 ETIMEDOUT (Connection timed out) poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}]) gettimeofday({1271185487, 264}, NULL) = 0 futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185489, 0} , ) = -1 ETIMEDOUT (Connection timed out) poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}]) gettimeofday({1271185489, 198}, NULL) = 0 futex(0x7f0a66f5b028, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1271185491, 0}, ) = -1 ETIMEDOUT (Connection timed out) 0x00338d00d417 in sem_timedwait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install cman-3.0.7-1.fc12.x86_64 (gdb) bt #0 0x00338d00d417 in sem_timedwait () from /lib64/libpthread.so.0 #1 0x003713e02311 in reply_receive (ipc_instance=0x2379ed0, res_msg=0x768c6a50, res_len=16) at coroipcc.c:476 #2 0x003713e02e7e in coroipcc_msg_send_reply_receive ( handle=3265522690949120001, iov=0x768c6a80, iov_len=1, res_msg=0x768c6a50, res_len=16) at coroipcc.c:1045 #3 0x003713a01ed3 in cpg_finalize (handle=5902762718137417729) at cpg.c:238 #4 0x00403542 in close_cpg_daemon () at /root/stable3/fence/fenced/cpg.c:2311 #5 0x0040b26d in loop (argc=value optimized out, argv=value optimized out) at /root/stable3/fence/fenced/main.c:831 #6 main (argc=value optimized out, argv=value optimized out) at /root/stable3/fence/fenced/main.c:1045 ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG model_init + callback with totem ringid and members
On Fri, Apr 09, 2010 at 09:33:30AM +0200, Jan Friesse wrote: Dave, Oh, and I may have just invented a time machine by merging partitioned clusters! 1270661597 cluster node 1 added seq 2128 1270661597 fenced:daemon conf 3 1 0 memb 1 2 4 join 1 left 1270661597 cpg_mcast_joined retried 4 protocol 1270661597 fenced:daemon ring 1:2128 3 memb 1 2 4 1270661597 fenced:default conf 3 1 0 memb 1 2 4 join 1 left (*) 1270661597 add_change cg 5 joined nodeid 1 1270661597 add_change cg 5 counts member 3 joined 1 remove 0 failed 0 1270661597 check_ringid cluster 2128 cpg 2:2124 1270661597 fenced:default ring 1:2128 3 memb 1 2 4 (**) 1270661597 check_ringid done cluster 2128 cpg 1:2128 1270661597 check_quorum done * confchg callback adding node 1 ** totem callback adding node 1 this is something little different and it is one of your requirements. Yes, this ordering makes sense and works. I was just pointing out that it's not *always* true that a totem callback precedes a confchg callback when adding a node. Obviously Chrissie was thinking about a node starting up and not the case of partition merging. ^^^ This is what you are talking about. Confchg precede totem callback (as your requirements) I never had a hard requirement about callback ordering, because I didn't know exactly what effect it would have. But my suggestion was that when an event caused both confchg and totem callbacks to be queued for a cpg, the confchg_cb be queued first and the totem_cb be queued second. Now that I've stepped through my test case a couple times with this issue in mind, I don't think I actually require any specific ordering of callbacks. It looks like things will work the same regardless. Anyway, can you please send me (exactly) what problem (original problem) are you trying to solve? My test case that hasn't worked (until now), is the following: a. members 1,2,3 b. partition 1 / 2,3 c. merge 1,2,3 d. cluster is killed on node 1 e. cluster is started on node 1 In this case nodes 2 and 3 see: a. cluster = 1,2,3 b. cluster -1 2228 c. cluster +1 2232 d. cluster -1 2236 e. cluster +1 2240 (cluster +/-N M is cman callback adding/removing nodeid N with ringid M) Node 2 begins fencing node 1 in step b, but I've configured fencing to fail indefinitely, so the fencing doesn't complete on 2 until step e when it sees node 1 restart cleanly (without its state). So *after* step e, node 2 dispatches the following callbacks back to back: u. conf +1 v. ring +1 2232 w. conf -1 x. ring -1 2236 y. ring +1 2240 z. conf +1 (conf +/-N is confchg callback adding/removing nodeid N) (ring +/-N M is totem callback adding/removing nodeid N with ringid M) Two problems I had which the the new ring id resolves: - When I saw w, I didn't know if this was a new failure that hadn't yet been reported via a cluster (cman) callback, or whether it was an old failure. In this case it corresponds to d, which I now know because the ringid in x is 2236 and the current cluster ringid is 2240. This is important because I need to know whether the current quorum value from cman is consistent with the state I've seen from cpg. - I only want to process the latest confchg because two matching confchg's, e.g. u and z, are otherwise impossible to uniquely reference between nodes. I refer to both u and z as confchg adding nodeid 1 resulting in members 1,2,3) If I process u, other nodes can sometimes not tell whether I'm referring confchg u or z. Both of these problems resulted in my app (fenced) getting one of those two things wrong and becoming stuck. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG model_init + callback with totem ringid and members
On Thu, Apr 08, 2010 at 04:15:06PM +0100, Christine Caulfield wrote: On 08/04/10 15:57, Jan Friesse wrote: Included is patch solving 2nd problem. In first problem, I agree with Chrissie, and really don't have any single idea how to make regular confchg precede totem_confchg. We can't. That is the order in which things happen. Short of implementing some form of time-machine in corosync it's not going to change :S That makes sense. I need to go back to the drawing board on all this and figure out if the totem callback approach is going to solve the problems I have. Oh, and I may have just invented a time machine by merging partitioned clusters! 1270661597 cluster node 1 added seq 2128 1270661597 fenced:daemon conf 3 1 0 memb 1 2 4 join 1 left 1270661597 cpg_mcast_joined retried 4 protocol 1270661597 fenced:daemon ring 1:2128 3 memb 1 2 4 1270661597 fenced:default conf 3 1 0 memb 1 2 4 join 1 left (*) 1270661597 add_change cg 5 joined nodeid 1 1270661597 add_change cg 5 counts member 3 joined 1 remove 0 failed 0 1270661597 check_ringid cluster 2128 cpg 2:2124 1270661597 fenced:default ring 1:2128 3 memb 1 2 4 (**) 1270661597 check_ringid done cluster 2128 cpg 1:2128 1270661597 check_quorum done * confchg callback adding node 1 ** totem callback adding node 1 Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG model_init + callback with totem ringid and members
On Thu, Apr 08, 2010 at 04:57:22PM +0200, Jan Friesse wrote: Included is patch solving 2nd problem. Thanks, it works for me. In first problem, I agree with Chrissie, and really don't have any single idea how to make regular confchg precede totem_confchg. I've stepped through things and it looks like the confchg/totem callbacks will work fine as they are, I don't think I need any change because of ordering. It just didn't match my initial expectations of what I'd see. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG model_init + callback with totem ringid and members
On Tue, Apr 06, 2010 at 02:05:00PM +0200, Jan Friesse wrote: Same patch but rebased on top of Steve's change (today trunk). Thanks, this is mostly working well, but I've found one problem, and one additional thing I need (mentioned on irc already): 1. When a node joins, I get the totem callback before the corresponding confchg callback. When a node leaves I get them in the expected order: confchg followed by totem callback. 2. When my app starts up it needs to be able to get the current ring id, so we need to be able to get/force an initial totem callback after a cpg_join that indicates the current ring id. I've also had a problem getting the current sequence number through libcman/cman_get_cluster()/ci_generation --- On node 2 I see: in cman_dispatch statechange callback: call cman_get_cluster(), get generation 2124 call cman_get_nodes(), see node 1 removed in cman_dispatch statechange callback: call cman_get_cluster(), get generation 2128 call cman_get_nodes(), see node 1 added in cman_dispatch statechange callback: call cman_get_cluster(), get generation 2128 (expect 2132) call cman_get_nodes(), see node 1 removed in cman_dispatch statechange callback: call cman_get_cluster(), get generation 2136 call cman_get_nodes(), see node 1 added The second time node 1 is removed I get the previous generation when node 1 was added instead of generation 2132 which the callback is for. On node 4 I do get generation 2132 in that callback as expected. So it seems like it could be a race, I've only gone through this test once. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG callback with totem ringid + members
On Tue, Mar 02, 2010 at 11:10:49AM +0100, Jan Friesse wrote: I'll give you example. Let's say, you have 3 nodes (a,b,c). B,C are joined in group EXAMPLE. Now, A will fall ... you will not get normal confchg, because A was not in group. Now on B, you will run new cpg process joined to group. If you will call cpg_ringid_get, you will get different result, then before A fall. So, main question is, WHEN ringid should change? From my point of view (because CPG is lightweight membership), it should change when group change. But group change doesn't mean totem membership change (both case can really happen. Group change without Totem membership change and Totem membership change without group change). Of course, if we rely on group change, totem ringid really doesn't make sense no longer. If we rely only on Totem membership change, we need something like I implemented in cpg_totem_confchg. The existing totem ring id already has a defined behavior, and I wasn't expecting anything beyond that. i.e. cpg_ringid_get would not tell us anything new about cpg memb changes, only abut totem changes. So, when a totem memb change caused a cpg memb change, then it would be useful. But, it's not really necessary to have this with your other patch, so let's just leave out the cpg_ringid_get. I've started working on the code to use this, and it might be nice if the parameters matched the normal confchg parameters as closely as possible, i.e. include cpg_name, and use cpg_address instead of uint32_t. I was thinking about that, but: - cpg_name of what? We are talking about Totem membership change. Totem doesn't know anything about groups. If you want group_name of pid/nodeid/group unique triple, it can be implemented, but ... can you feel it doesn't fit very well? OK, it's probably not necessary. Is there a totem confchg callback per handle then? And I still get all normal cpg confchg callbacks before the totem_confchg callback? Only thing from that structure which is used in Totem membership change is nodeid (what we are returning currently. True is that member_list_entries is really not good name and should be something like node_list_entries). - What should be filled in pid? Totem doesn't know about client pids. Ah, right, I'd not considered that. It's probably better to keep them nodeids then. - Very similar is reason field. I can imagine to return only CPG_REASON_NODEDOWN and/or CPG_REASON_NODEUP. I don't expect I'll need reason. Thanks, Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG callback with totem ringid + members
On Mon, Feb 22, 2010 at 06:00:21PM +0100, Jan Friesse wrote: +struct cpg_ring_id { + uint32_t nodeid; + uint64_t seq; +}; What do you think about combining this patch with the other patch that adds cpg_ringid_get()? It's troublesome to combine the two patches to test. +typedef void (*cpg_totem_confchg_fn_t) ( + cpg_handle_t handle, + struct cpg_ring_id ring_id, + uint32_t member_list_entries, + const uint32_t *member_list); I've started working on the code to use this, and it might be nice if the parameters matched the normal confchg parameters as closely as possible, i.e. include cpg_name, and use cpg_address instead of uint32_t. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync - CPG callback with totem ringid + members
On Mon, Feb 22, 2010 at 06:00:21PM +0100, Jan Friesse wrote: Related to https://bugzilla.redhat.com/show_bug.cgi?id=529424 Patch implements new callback with current totem ring id and members. Included is modified testcpg using functionality. As required, callback is delivered AFTER all normal confchg callbacks. Patch is not 100% tested (specially big endian issues and whitetank/older versions of corosync coexistence) but looks stable. David, fulfill that functionality your requirements? Thanks Honza! This looks like it will do well, but I can't be certain until I work through the implementation and testing to use this on my end. I'll try to get working on that soon so I can let you know. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] does self-fencing makes sense?
On Fri, Feb 19, 2010 at 03:31:10PM -0700, Steven Dake wrote: There are millions of lines of C code involved in directing a power fencing device to fence a node. Generally in this case, the system directing the fencing is operating from a known good state. There are several hundred lines of C code that trigger a reboot when a watchdog timer isn't fed. Generally in this case, the system directing the fencing (itself) has entered an undefined failure state. So a quick matrix: modelLOC operating environment power fencingmillions well-defined self fencing hundreds undefined I completely agree with you that less code is more trustworthy than more in general. But your thesis seems to be based entirely on the hundreds vs millions difference which I simply don't see. Anyone can configure a watchdog to replace power fencing today, it's simple, and there will be negligible difference in the amount of code that's involved. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] [QUORUM] This node is within the primary component and will provide service.
The corosync logs are so full of these messages that they end up being unhelpful. I think they could be made very helpful, though, if they were printed when the quorum state changed. Dave Index: exec/vsf_quorum.c === --- exec/vsf_quorum.c (revision 2662) +++ exec/vsf_quorum.c (working copy) @@ -135,11 +135,12 @@ size_t view_list_entries, int quorum, struct memb_ring_id *ring_id) { + int old_quorum = primary_designated; primary_designated = quorum; - if (primary_designated) { + if (primary_designated !old_quorum) { log_printf (LOGSYS_LEVEL_NOTICE, This node is within the primary component and will provide service.\n); - } else { + } else if (!primary_designated old_quorum) { log_printf (LOGSYS_LEVEL_NOTICE, This node is within the non-primary component and will NOT provide any services.\n); } ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] corosync-objctl **binary**
On Wed, Jan 13, 2010 at 02:49:53PM +1100, Angus Salkeld wrote: On Wed, Jan 13, 2010 at 6:06 AM, David Teigland teigl...@redhat.com wrote: corosync-objctl used to print a lot of useful information which now appears only as **binary**. ?Is there a way to get that back? Perhaps two output modes, one where it prints binary values in hex and another where it makes a best effort to interpret and print the values in a useful form? Dave Hi David The keys are now typed, the default as used by the old API defaults to ANY (or void*). So if we have uses of the old API then these objects are printed out as **BINARY**. If they are in actual fact strings then we need to update the call to key_create() to use the new API, which alows us to pass in the type (in this case STRING). I wonder if there's anything preventing us from using the new API in the cluster.git code? Can you send me the output of objctl. I just want to see which objects are still not created correctly. cluster.name=**binary**(9) cluster.config_version=**binary**(9) cluster.totem.token=**binary**(9) cluster.clusternodes.clusternode.name=**binary**(9) cluster.clusternodes.clusternode.nodeid=**binary**(9) cluster.clusternodes.clusternode.fence.method.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.name=**binary**(9) cluster.clusternodes.clusternode.unfence.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.action=**binary**(9) cluster.clusternodes.clusternode.name=**binary**(9) cluster.clusternodes.clusternode.nodeid=**binary**(9) cluster.clusternodes.clusternode.fence.method.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.name=**binary**(9) cluster.clusternodes.clusternode.unfence.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.action=**binary**(9) cluster.clusternodes.clusternode.name=**binary**(9) cluster.clusternodes.clusternode.nodeid=**binary**(9) cluster.clusternodes.clusternode.fence.method.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.name=**binary**(9) cluster.clusternodes.clusternode.unfence.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.action=**binary**(9) cluster.clusternodes.clusternode.name=**binary**(9) cluster.clusternodes.clusternode.nodeid=**binary**(9) cluster.clusternodes.clusternode.fence.method.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.name=**binary**(9) cluster.clusternodes.clusternode.fence.method.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.name=**binary**(9) cluster.clusternodes.clusternode.unfence.device.port=**binary**(9) cluster.clusternodes.clusternode.unfence.device.action=**binary**(9) cluster.fencedevices.fencedevice.name=**binary**(9) cluster.fencedevices.fencedevice.agent=**binary**(9) cluster.fencedevices.fencedevice.ipaddr=**binary**(9) cluster.fencedevices.fencedevice.name=**binary**(9) cluster.fencedevices.fencedevice.agent=**binary**(9) cluster.fencedevices.fencedevice.ipaddr=**binary**(9) cluster.cman.nodename=**binary**(9) cluster.cman.cluster_id=**binary**(9) totem.token=**binary**(9) totem.version=**binary**(9) totem.nodeid=**binary**(9) totem.vsftype=**binary**(9) totem.token_retransmits_before_loss_const=**binary**(9) totem.join=**binary**(9) totem.consensus=**binary**(9) totem.rrp_mode=**binary**(9) totem.secauth=**binary**(9) totem.key=**binary**(9) totem.interface.ringnumber=**binary**(9) totem.interface.bindnetaddr=**binary**(9) totem.interface.mcastaddr=**binary**(9) totem.interface.mcastport=**binary**(9) libccs.next_handle=**binary**(9) libccs.connection.ccs_handle=**binary**(9) libccs.connection.config_version=**binary**(9) libccs.connection.fullxpath=**binary**(9) libccs.connection.ccs_handle=**binary**(9) libccs.connection.config_version=**binary**(9) libccs.connection.fullxpath=**binary**(9) libccs.connection.ccs_handle=**binary**(9) libccs.connection.config_version=**binary**(9) libccs.connection.fullxpath=**binary**(9) logging.timestamp=**binary**(9) logging.to_logfile=**binary**(9) logging.logfile=**binary**(9) logging.logfile_priority=**binary**(9) logging.to_syslog=**binary**(9) logging.syslog_facility=**binary**(9) logging.syslog_priority=**binary**(9) aisexec.user=**binary**(9) aisexec.group=**binary**(9) service.name=**binary**(9) service.ver=**binary**(9) service.name=**binary**(9) service.ver=**binary**(9) quorum.provider=**binary**(9) service.name=**binary**(9) service.ver=**binary**(9) service.name=**binary**(9) service.ver=**binary**(9
[Openais] corosync-objctl **binary**
corosync-objctl used to print a lot of useful information which now appears only as **binary**. Is there a way to get that back? Perhaps two output modes, one where it prints binary values in hex and another where it makes a best effort to interpret and print the values in a useful form? Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] [PATCH] corosync/trunk QUORUM log message
This puts multiple nodeids on each [QUORUM] Members line instead of putting each nodeid on a separate line. With more than a few nodes the excessive lines become a real nuisance, and anyone up around 32 nodes may literally be scrolling through hundreds of those lines. Index: vsf_quorum.c === --- vsf_quorum.c(revision 2562) +++ vsf_quorum.c(working copy) @@ -103,7 +103,35 @@ static size_t quorum_view_list_entries = 0; static int quorum_view_list[PROCESSOR_COUNT_MAX]; struct quorum_services_api_ver1 *quorum_iface = NULL; +static char view_buf[64]; +static void log_view_list(const unsigned int *view_list, size_t view_list_entries) +{ + int total = (int)view_list_entries; + int len, pos, ret, cnt; + int i = 0; + + while (1) { + len = sizeof(view_buf); + pos = 0; + memset(view_buf, 0, len); + cnt = 0; + + for (; i total; i++) { + ret = snprintf(view_buf + pos, len - pos, %d, view_list[i]); + if (ret = len - pos) + break; + pos += ret; + cnt++; + } + log_printf (LOGSYS_LEVEL_NOTICE, Members[%d]:%s%s, + total, view_buf, i total ? \\ : ); + + if (i == total) + break; + } +} + /* Internal quorum API function */ static void quorum_api_set_quorum(const unsigned int *view_list, size_t view_list_entries, @@ -123,9 +151,7 @@ memcpy(quorum_ring_id, ring_id, sizeof (quorum_ring_id)); memcpy(quorum_view_list, view_list, sizeof(unsigned int)*view_list_entries); - log_printf (LOGSYS_LEVEL_NOTICE, Members[%d]: , (int)view_list_entries); - for (i=0; iview_list_entries; i++) - log_printf (LOGSYS_LEVEL_NOTICE, %d , view_list[i]); + log_view_list(view_list, view_list_entries); /* Tell internal listeners */ send_internal_notification(); ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] cherrypicking into flatiron discussion - post 1.1.0
On Mon, Sep 21, 2009 at 08:35:33AM -0700, Steven Dake wrote: 4) flatiron to trail trunk with bug resolution It appears waiting months to cherrypick patches doesn't produce a high quality flatiron that people can use continuously. I'm open to suggestions. One option is to set some time limit on which a bug fix patch will remain in trunk before being merged into flatiron. Time open to debate... 7-10 days? This seems backward, since most testing and bug fixes will originate in flatiron, and the bug fixes are highly relevant to the flatiron that people are using and much less relevant to trunk where real use is low. We're much more concerned (and it's much more useful) that a bug fix soak in flatiron as opposed to trunk. I suggest all bug fixes go immediately into flatiron for testing/soaking. You cut this off a week+ prior to releasing from flatiron, of course. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] correlating events
On Thu, Sep 10, 2009 at 04:11:28PM -0700, Steven Dake wrote: IMO the proper way to do this is to ensure whatever ringid was delivered in a callback to the application is the current ring id returned by the api. This gets rid of any races you describe above. I can't really think of any races that would concern me. I described two different queries using one function, maybe it would be clearer if I described them using two separate functions. 1. cpg_ringid_confchg_cb(id1) id1 is the ringid associated with the last cpg confchg callback delivered to the app via cpg_dispatch(). If I call cpg_ringid_confchg_cb() from within a callback, I will be able to know the ringid associated with each confchg. Of course cpg confchgs (joins/leaves) can happen without a change in the ringid. And likewise, the ringid can change without any corresponding cpg confchg. Cman on the otherhand is always in step with each ringid change. What I want my app to do is wait until it knows that cpg and cman are in sync with each other: 1. If cpg has more recent events than cman, then wait for cman to catch up. (the cpg_ringid_confchg_cb call above will solve this one) 2. If cman has more recent events than cpg, then wait for cpg to catch up. (still looking for a way to do this one) So the next function is trying to solve 2, and I figured using ringid's again might be good. What makes it tricky is that the most recent ringid returned by cman may not cause a cpg confchg. The last ringid returned by cpg_ringid_confchg_cb() may be less than the cman ringid, and waiting for them to match won't work. When the cman ringid is greater than the cpg ringid, the app doesn't know if it's because the cpg callbacks just haven't been delivered yet, or because there are no cpg callbacks for that ringid. Functions of various forms could tell us, though. One possibility: 2. rv = cpg_ringid_done(ringid) (I'd pass in the ringid from cman) rv would be 0 if there are any undelivered confchgs to the app for the ringid provided rv would be 1 if all confchgs have been delivered to the app up to and including the ringid provided Or, something like I mentioned in the previous mail where cpg returns the latest ringid it has seen for which all confchgs (if any) have been delivered to the app. Chrissie pointed out that libcman only returns the 64 bit ringid as uint32, but I doubt we'll see ringid's bigger than that even if we do I'm just comparing consecutive id's so the lower 32 bits should be fine. Once the ring id is greater then 32 bits, you would always be comparing 0. I don't follow. Looks like cman needs this error corrected, along with the addition of the ring leader node id. A ring id is uniquely identified by the nodeid of the ring leader and the 64 bit value of the ringid. Need both values in the comparison. I'm mainly interested in an equal comparison of ringids, but it might be convenient to know if one came after another. Would the ringid sequence number ever not increase and in what situations? Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] correlating events
On Mon, Aug 31, 2009 at 02:28:33PM -0700, Steven Dake wrote: On Mon, 2009-08-31 at 15:44 -0500, David Teigland wrote: Here are two related and troublesome problems that would be nice to fix, probably in future versions -- they probably can't be fixed maintaining existing apis and protocols (although adding new api's to help with them might be nice if possible). 1. correlating events from different services locally I get nodedown from both cman (or quorum service) and cpg. I need to correlate them with each other. When I get a cpg nodedown for node A, I don't know which cman nodedown for A it refers to: one of multiple in the past or one in the future that cman hasn't reported yet. Correlation could be solved by addition of api to cman, cpg, and quorum to retrieve the globally unique ring id for the last configuration change delivered to the application. If you agree, we can work on the implementation for corosync 1.1. Adding this to CPG is trivial, not sure about other services. Our policies wrt x.y.z would not be violated with this change. As an example, the API for cpg might look like cpg_ringid_get (handle, ring_id); Then ring_id could be memcmp'ed in the application. This would retrieve the last ring id delivered to the application (not the current ring id known to the cpg service). Turns out that libcman already has a call that returns the ring id, so all I need now is the addition to cpg. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] correlating events
Here are two related and troublesome problems that would be nice to fix, probably in future versions -- they probably can't be fixed maintaining existing apis and protocols (although adding new api's to help with them might be nice if possible). 1. correlating events from different services locally I get nodedown from both cman (or quorum service) and cpg. I need to correlate them with each other. When I get a cpg nodedown for node A, I don't know which cman nodedown for A it refers to: one of multiple in the past or one in the future that cman hasn't reported yet. 2. correlating events among nodes Some kind of global event/generation id associated with each configuration that nodes can use to refer to the same event. (For extra credit, make this useful to detect the first configuration of a cpg across the cluster.) Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] correlating events
On Mon, Aug 31, 2009 at 02:28:33PM -0700, Steven Dake wrote: On Mon, 2009-08-31 at 15:44 -0500, David Teigland wrote: Here are two related and troublesome problems that would be nice to fix, probably in future versions -- they probably can't be fixed maintaining existing apis and protocols (although adding new api's to help with them might be nice if possible). 1. correlating events from different services locally I get nodedown from both cman (or quorum service) and cpg. I need to correlate them with each other. When I get a cpg nodedown for node A, I don't know which cman nodedown for A it refers to: one of multiple in the past or one in the future that cman hasn't reported yet. Correlation could be solved by addition of api to cman, cpg, and quorum to retrieve the globally unique ring id for the last configuration change delivered to the application. If you agree, we can work on the implementation for corosync 1.1. Adding this to CPG is trivial, not sure about other services. Our policies wrt x.y.z would not be violated with this change. As an example, the API for cpg might look like cpg_ringid_get (handle, ring_id); Then ring_id could be memcmp'ed in the application. This would retrieve the last ring id delivered to the application (not the current ring id known to the cpg service). Nice, I think that should work well for what I need. I'd probably call ringid_get() within the callback itself to get the ringid for it. 2. correlating events among nodes Some kind of global event/generation id associated with each configuration that nodes can use to refer to the same event. (For extra credit, make this useful to detect the first configuration of a cpg across the cluster.) define event here, you mean configuration change event or message delivery event? I'm thinking of confchg's. The problem I'm currently working around is when the app gets a series of cpg confchg's and then wants to communicate and refer to one of the confchg's in particular. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync] - Allow only one connection per (node, pid, grp)
On Mon, Jul 20, 2009 at 10:03:36AM +0200, Jan Friesse wrote: Patch solves problem, when one process connect multiple times to one group by disallow this situation. Please see patch comment for more informations. David, do you agree, that this is how cpg should behave, or you would rather see to support multiple (node, pid, grp)? (for me, it really doesn't make any sense). Returning an error seems pretty obvious, I can't imagine anyone thinks it makes sense. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync] [patch] - Fix problems with long token timeout and cpg
On Thu, Jul 02, 2009 at 11:09:26AM -0700, Steven Dake wrote: On Thu, 2009-07-02 at 09:27 -0500, David Teigland wrote: On Thu, Jul 02, 2009 at 01:15:18PM +0200, Jan Friesse wrote: David Teigland wrote: On Wed, Jul 01, 2009 at 01:46:03PM -0500, David Teigland wrote: other nodes should immediately recognize it has previously failed and process a complete failure for it. i.e. the full equivalent to what apps (using any api's) would see if the node had failed via normal token timeout. More or less agree, but does this patch fixed problem for you or not? I haven't tried the patch, but based on the description and a quick look at the patch, I don't think it helps. Think more broadly about what's happening here, don't focus on one particular effect. 1. nodes 1,2,3,4: are cluster members 2. nodes 1,2,3,4: are using services A,B,C,D 3. node4: ifdown eth0, kill corosync 4. node4: ifup eth0, start corosync 5. node4: do not start/use any services 6. nodes 1,2,3: never see node4 removed from membership 7. nodes 1,2,3: services A,B,C,D never see node4 removed/fail Individual services have to protect against those sorts of restarts. The only other mechanism would be to break wire compatibility within Totem. I'm trying to define my specific problem for you; how/when/where you actually fix it isn't my main concern at this point. (I'd suggest starting with a real, proper fix, without regard to compatibility restrictions. We'll get that working well. Then, investigate the options for backporting the same behavior into stable versions. Doing that without breaking compat will often involve some imperfect hacks.) This patch resolves the cpg case which is what the original bug was filed against. It may resolve a problem that you're defining, but it doesn't resolve the problem I'm defining. Would you like bz 506255 to represent your bug or mine? If yours, then I'll open a new bz. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync] [patch] - Fix problems with long token timeout and cpg
On Wed, Jul 01, 2009 at 06:21:14PM +0200, Jan Friesse wrote: Included patch should fix https://bugzilla.redhat.com/show_bug.cgi?id=506255 . David, I hope it will fix problem for you. It's based on simple idea of adding node startup timestamp at the end of cpg_join (and joinlist) calls. If timestamp is larger then old timestamp we know, node was restarted and we didn't notices - deliver leave event and then join event. If timestamp is same (or in special cases lower) - new cpg app joined - send only join event. Of course, patch isn't so simple. Cpg_join messages are always send as larger messages with timestamp (btw. timestamp is 64-bit value, because I expect l(o^64)ng life of corosync ;) ). On delivery, we test, if message is larger then standard message. If it is - we have ts - use it. Bigger problem was joinlist, because it's array, ... you will see in source. Solution is, to send special entry, with pid 0 (shouldn't ever happened to process, to have pid 0), and timestamp encoded in name (ugly, but looks like working). Please comment, if you can. This isn't specifically a cpg bug/problem, it's a problem with corosync/openais in general. When a node joins the cluster before others have recognized it failed, the other nodes should immediately recognize it has previously failed and process a complete failure for it. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync] [patch] - Fix problems with long token timeout and cpg
On Wed, Jul 01, 2009 at 01:46:03PM -0500, David Teigland wrote: other nodes should immediately recognize it has previously failed and process a complete failure for it. i.e. the full equivalent to what apps (using any api's) would see if the node had failed via normal token timeout. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] change startup notice to Corosync Cluster Engine
On Mon, Jun 22, 2009 at 10:48:18PM -0700, Steven Dake wrote: While you're there, perhaps knock down the level of those messages so we don't see it all in /var/log/messages every time? Jun 22 14:58:12 bull-01 corosync[2343]: [MAIN ] Corosync Executive Service RELEASE 'trunk' Jun 22 14:58:12 bull-01 corosync[2343]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Jun 22 14:58:12 bull-01 corosync[2343]: [MAIN ] Copyright (C) 2006-2008 Red Hat, Inc. Jun 22 14:58:12 bull-01 corosync[2343]: [MAIN ] Corosync Executive Service: started and ready to provide service. Everyone else seems to get by with a single I started line. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] call for roadmap features for future releases
On Mon, Jun 22, 2009 at 09:26:06AM -0700, Steven Dake wrote: On Mon, 2009-06-22 at 10:59 -0500, David Teigland wrote: On Sat, Jun 20, 2009 at 11:51:40AM -0700, Steven Dake wrote: I invite all of our contributors to help define the X.Y roadmap of both corosync and openais. Please submit your ideas on this list. Some examples of suggested ideas have been things like converting to libtool. Also new service engine ideas are highly welcome. Keep ideas within a 1 month - 3 year timeframe. I intend to publish the roadmap with the release of Corosync and OpenAIS 1.0.0. Please submit your ideas by June 26, 2009 (friday). More apis/tools for querying/reporting internal state. So external (as in not part of the corosync binary) diagnostic tools? Yes, like corosync-cfgtool, corosync-objctrl. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] cpgx stuck
On Wed, Jun 03, 2009 at 04:28:27PM -0500, David Teigland wrote: Running cpgx -d1 on four nodes, where -d1 causes the test to periodically kill and restart corosync. When this kill/restart happens on one node, others are typically exiting/joining the cpg during at the same time. The result is that cpgx stops receiving any cpg callbacks, and it just sits there forever. More specifically, it appears that any cpg join gets stuck if the join occurs during the failure/recovery period of another node that was killed. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync] [patch] - ckpt solution - Change of Makefile.am
On Wed, May 27, 2009 at 04:15:52PM +0200, Jan Friesse wrote: Hi, included is patch for Makefile.am of corosync, so coroipcc.o is no longer included in lib... directly, but rather *.so is a dependency, so ipc_hdb is no longer in multiple *.so and multiple times in binary what causes problem. Should solve https://bugzilla.redhat.com/show_bug.cgi?id=499918. David, can you please confirm, that solved your problem? Thanks. Yes it does, thanks, Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] new test prog
I wrote a new program cpgx to test the virtual synchrony guarantees of corosync and cpg, http://fedorapeople.org/gitweb?p=teigland/public_git/dct-stuff.git;a=summary It joins a cpg, then randomly sends messages, leaves or exits, and repeats. This all creates a random sequence of messages and configuration changes (events). Everyone keeps a history of all events, and continually compares their history against everyone else. This event history is the replicated state of the program, upon which all future state is based, and which needs to be synced to a node when it joins (state transfer). If any node sees a different event sequence or content from another (violating VS), it should be quickly detected and easy to see exactly what was wrong. It's simple to run, just start cpgx on up to 8 nodes running corosync, one instance per node; nodes must have nodeid's between 1 and 255. If there's a problem it will stop running with an ERROR message. It only tries to prove VS behavior, but it incidentally tests other aspects of corosync also, e.g. it quickly reproduces this recent regression: https://lists.linux-foundation.org/pipermail/openais/2009-May/012138.html With the non-default -d1 option it will include approximated node failures in the random mix of events by periodically killing corosync and restarting it with cman_tool. (I may later use iptables to simulate more realistic node failures.) It's not default because it often causes corosync to hang; apparently one of those incidental other bugs. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync trunk] fix confchg races in cpg
On Thu, May 21, 2009 at 07:36:28AM -0700, Steven Dake wrote: It is possible with 3+ nodes joining or leaving at same time for a configuration change to be delivered to the user which it is not meant for. This patch solves that problem. ack, using this patch I can't reproduce the problem ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [corosync + openais] [patch] Dispatch return bad handle - proposed solution
On Tue, May 19, 2009 at 03:40:53PM +0200, Jan Friesse wrote: Hi, attached are proposed solution to *dispatch* functions, which returns CS_ERR_BAD_HANDLE (AIS_ERR_BAD_HANDLE (9)). David, can you please test them, and give results? Thanks, I tried the corosync patch, and cpg_dispatch error 9 is gone. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [PATCH] fix delayed shutdown
On Mon, May 18, 2009 at 01:44:50PM +0100, Chrissie Caulfield wrote: Steven Dake wrote: I don't think this will be backwards compatible with whitetank. IMO use the memb_join_message_send function as outlined. If you can show it works with whitetank then looks good for commit. OK, here's a new patch that doesn't create a new message type. The reason I had that in before was due to another bug I hadn't spotted :S I've tested this against whitetank and it works fine. Thanks, this works nicely. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] saCkptSectionIterationNext() error
On Thu, May 07, 2009 at 12:46:33AM -0700, Steven Dake wrote: On Wed, 2009-05-06 at 16:26 -0500, David Teigland wrote: I think we may have lost something in transit between irc/email/svn, Mar 26 16:10:20 dct confchg, node1 create ckpt, node2 open ckpt, node2 read ckpt - fail Mar 26 16:10:46 dct nodeid 1 creates the ckpt Mar 26 16:13:42 dct saCkptCheckpointOpen() works, saCkptSectionIterationInitialize() works, then saCkptSectionIterationNext() fails Mar 26 16:30:34 sdake wow iteration fails straight up single node Mar 26 16:30:39 sdake that was working like 1 week ago or less Mar 26 16:52:30 sdake dct found problem Mar 26 16:52:32 sdake patch coming to list now This looks like the patch, but I don't see it in svn https://lists.linux-foundation.org/pipermail/openais/2009-March/011048.html And I'm still getting error 9 (BAD_HANDLE) from saCkptSectionIterationNext(). Dave That fix is in openais trunk and handle iteration works for me. Are you using openais-0.96.tar.gz? svn trunk, [openais/trunk/services]% svn info ckpt.c Path: ckpt.c Name: ckpt.c URL: svn+ssh://svn.fedorahosted.org/svn/openais/trunk/services/ckpt.c Repository Root: svn+ssh://svn.fedorahosted.org/svn/openais Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f Revision: 1888 Node Kind: file Schedule: normal Last Changed Author: sdake Last Changed Rev: 1862 Last Changed Date: 2009-04-25 20:48:50 -0500 (Sat, 25 Apr 2009) Text Last Updated: 2009-04-27 14:21:47 -0500 (Mon, 27 Apr 2009) Checksum: 674843c3c135e651655eed1beab88a1b ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] cpg_dispatch BAD_HANDLE
I recently started getting BAD_HANDLE errors from cpg_dispatch() when leaving a cpg: - cpg_leave() - cpg_dispatch(handle, CPG_DISPATCH_ALL) - dispatch executes a confchg for the leave - dispatch returns 9 It doesn't break anything, but I'd like to avoid adding code to detect when I should or shouldn't ignore BAD_HANDLE errors. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] detecting cpg joiners
On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote: On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote: On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote: 0. configure token timeout to some long time that is longer than all the following steps take 1. cluster members are nodeid's: 1,2,3,4 2. cpg foo has the following members: nodeid 1, pid 10 nodeid 2, pid 20 nodeid 3, pid 30 nodeid 4, pid 40 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40 (optionally reboot this node now) 4. nodeid 4: ifup eth0, start corosync 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg showing that 4:40 is not a member 6. nodeid 4: start process pid 41 that joins cpg foo 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg showing that 4:41 is a member (Steps 6 and 7 should work the same even if the process started in step 6 has pid 40 instead of pid 41.) 100% agree that is how it should work. If it doesn't, we will fix it. The only thing that may be strange is if pid in step 6 is the same pid as 40. Are you certain the test case which fails has a differing pid at step 6? If you fix step 5, then I suspect steps 6,7 will just work. After the test failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that the pid in step 6 was different (I didn't reboot the node). It's not clear what the plan was for this, any recent related changes I should try? Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] detecting cpg joiners
On Wed, May 06, 2009 at 02:10:27PM -0700, Steven Dake wrote: On Wed, 2009-05-06 at 15:04 -0500, David Teigland wrote: On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote: On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote: On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote: 0. configure token timeout to some long time that is longer than all the following steps take 1. cluster members are nodeid's: 1,2,3,4 2. cpg foo has the following members: nodeid 1, pid 10 nodeid 2, pid 20 nodeid 3, pid 30 nodeid 4, pid 40 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40 (optionally reboot this node now) 4. nodeid 4: ifup eth0, start corosync 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg showing that 4:40 is not a member 6. nodeid 4: start process pid 41 that joins cpg foo 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg showing that 4:41 is a member (Steps 6 and 7 should work the same even if the process started in step 6 has pid 40 instead of pid 41.) 100% agree that is how it should work. If it doesn't, we will fix it. The only thing that may be strange is if pid in step 6 is the same pid as 40. Are you certain the test case which fails has a differing pid at step 6? If you fix step 5, then I suspect steps 6,7 will just work. After the test failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that the pid in step 6 was different (I didn't reboot the node). It's not clear what the plan was for this, any recent related changes I should try? Dave I haven't tried corosync with this test case, but it should work now. Did you try latest corosync on this case? If it still fails Jan can address before 1.0. Just tried it, and I get the same behavior as before. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] saCkptSectionIterationNext() error
I think we may have lost something in transit between irc/email/svn, Mar 26 16:10:20 dct confchg, node1 create ckpt, node2 open ckpt, node2 read ckpt - fail Mar 26 16:10:46 dct nodeid 1 creates the ckpt Mar 26 16:13:42 dct saCkptCheckpointOpen() works, saCkptSectionIterationInitialize() works, then saCkptSectionIterationNext() fails Mar 26 16:30:34 sdake wow iteration fails straight up single node Mar 26 16:30:39 sdake that was working like 1 week ago or less Mar 26 16:52:30 sdake dct found problem Mar 26 16:52:32 sdake patch coming to list now This looks like the patch, but I don't see it in svn https://lists.linux-foundation.org/pipermail/openais/2009-March/011048.html And I'm still getting error 9 (BAD_HANDLE) from saCkptSectionIterationNext(). Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] Partition Recovery and CPG
On Thu, Apr 16, 2009 at 12:29:27PM -0500, David Teigland wrote: VS guarantees that all cpg members will see the same sequence of messages and configuration changes, i.e. history of events. If a cpg is partitioned, that immediately violates VS. One part must be killed so that the remaining nodes will all agree on one version of history, thus maintaining VS. Partitioning can't be avoided, so an application must be able to deal with it and kill/stop one part (assuming the app depends on VS.) corosync might make this easier by not merging cpg's (or even whole clusters) that have been partitioned, but that raises other questions and I've been told that doing it would be next to impossible. I've done more reading, and it's become clear why corosync works the way it does, and shouldn't really be blamed. Corosync implements the totem protocol which is Extended Virtual Synchrony. I never knew the difference between Virtual Synchrony and Extended Virtual Synchrony. It turns out that this partitioning/remerging behavior is exactly what makes EVS different from VS. EVS/totem assumes that an app wants to continue running after being partitioned, so it extends some message ordering guarantees among nodes in both partitions. Messages sent just before the partition may be delivered in both partitions, and the idea behind EVS is that these messages will be delivered in the same order in the separate partitions (even though other messages, and confchg's of course, will be different). To get this message ordering between partitions without violating other rules, EVS adds a second transitional configuration change. So an app sees two configuration changes, the first removing nodes, the second potentially adding nodes. It seems that most apps want the standard VS behavior, and there's some doubt about whether the EVS behaviors would really be wanted or needed in real applications. So, our apps are left doing some extra work to reduce the EVS behavior to something closer to traditional VS. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [Corosync] Patch - Decouple shutdown ordering from objdb position
On Wed, Apr 29, 2009 at 02:28:05PM +0200, Andrew Beekhof wrote: At the moment, startup and shutdown ordering is controlled by the plugin's position in an objdb list. This is particularly problematic for cluster resource managers which must be unloaded/stopped first. The reason for this is that they (or the resources they control) need access to at least some of the other services provided by Corosync. Based on input from Steve, this patch resolves the shutdown side of the equation and if its acceptable I'll work on the startup side of things. I wonder if the recent cfg shutdown api from chrissie would be relevant to this? It also brings up the question of whether corosync should have a program to start/stop the corosync daemon? corosync_tool join to set a config method and start corosync corosync_tool leave to stop corosync if it's not being used by apps (would use the cfg api) Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [PATCH] corosync/trunk: add logging backward compatibility config layer
On Tue, Apr 21, 2009 at 07:43:04PM +0200, Fabio M. Di Nitto wrote: On Tue, 2009-04-21 at 08:51 -0500, Ryan O'Hara wrote: On Tue, Apr 21, 2009 at 06:06:25AM +0200, Fabio M. Di Nitto wrote: Hi guys, in order to match the new logging config spec, 2 logging config keywords had to be changed. Can you direct me to this spec? Check cluster-devel and openais mailing list. The logging directives have been discussed to death a few tons of times.. I don't have a URL to the archive handy. The cluster.conf man page shows at least my own version of what we're aiming for: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=blob;f=config/man/cluster.conf.5;h=d9e50d4e6b4569b78a6b4007f94da6f20001fa46;hb=refs/heads/STABLE3 ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes? Reply-To:
On Fri, Apr 17, 2009 at 10:56:47PM -0700, Steven Dake wrote: On Sat, 2009-04-18 at 07:49 +0200, Dietmar Maurer wrote: like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). The checkpoint service will merge checkpoints from both partitions into one view because both node 1 and node2 send out their checkpoint state on a merge operation. That doesn't make any sense, I can't believe that's how it works, the resulting content would be complete nonsense. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, Apr 18, 2009 at 07:49:12AM +0200, Dietmar Maurer wrote: like a 'merge' function? Seems the algorithm for checkpoint recovery always uses the state from the node with the lowest processor id? Yes that is right. So if I have the following cluster: Part1: node2 node3 node4 Part2: node1 Let assume Part1 is running for some time and has gathered some state in checkpoints. Part2 is just the newly started node1. So when node1 starts up the whole cluster uses the empty checkpoint from node1? (I guess I am confused somehow). It is *not* as simple as node with the low nodeid. It is node with the low nodeid where the state exists. When selecting the node to send state to others, you obviously need to select among nodes that have the state :-) In the dlm_controld example I mentioned earlier, the function called set_plock_ckpt_node() picks the node that will save state in the ckpt: list_for_each_entry(memb, cg-members, list) { if (!(memb-start_flags DLM_MFLG_HAVEPLOCK)) continue; if (!low || memb-nodeid low) low = memb-nodeid; } Only nodes that have state will have the DLM_MFLG_HAVEPLOCK flag set; new nodes just added by a confchg will not have that flag set. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] Partition Recovery and CPG
On Sat, Apr 18, 2009 at 09:37:26AM +0200, Dietmar Maurer wrote: Yes, forcing the losers to reset and start from scratch is a must, but we end up doing that a layer above corosync. That means the losers often reappear again through corosync/cpg prior to being forced out. Are you talking about an implementation bug, or a 'bubbling idiot' which simply joins/leaves many times? It may make sense to allow cpg partitions and merges for apps that do not require the VS guarantees from cpg. It does not make sense for apps (like mine) that rely on VS. The cause of transient partitions/merges during normal operation is largely unknown. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Sat, Apr 18, 2009 at 03:55:57AM -0700, Steven Dake wrote: On Sat, 2009-04-18 at 12:47 +0200, Dietmar Maurer wrote: At least the SA Forum does not mention such strange behavior. Isn't that a serious bug? Yes, I'd consider it a serious bug. Consider 2 Partitions with one checkpoint: Part1: CkptSections ABC Part2: CkptSections BCD After the merge, you have: CkptSections ABCD And even worse, section contains data from different partitions (old data mixed with new one)? And there is no notification that such things happens? That ckpt behavior is nonsensical for most real applications I'd wager. I'm going to have to go check whether my apps are protected from that. The SA Forum doesn't consider at all how to handle partitions in a network or at least not very suitably (up to designer of SA Forum services). They assume that applications will be using the AMF, and rely on the AMF functionality to reboot partitioned nodes (fencing) so this condition doesn't occur. They don't consider it presumably because *it doesn't make any sense*. The SA Forum services were not designed with partitioned networks in mind. It is unfortunate, but it is what it is. If an app needs true consistently without some form of fencing, the app designer has to take partitions into consideration when designing their applications. This is why I recommend using CPG for these types of environments because it provides better design control over exactly how data remerges. If the SAF services don't specify what should happen when clusters with divergent state are combined, then it probably means it should not happen, and you should probably not allow the unspecified behavior instead of making something up. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] Partition Recovery and CPG
On Thu, Apr 16, 2009 at 12:38:19PM +0200, Dietmar Maurer wrote: Lest assume the cluster is partitioned: Part1: node1 node2 node3 Part2: node4 node5 After recovery, what join/leave messaged do I receive with a CPG: A.) JOIN: node4 node5 or B.) JOIN: node1 node2 node3 or anything else? In practice I believe you'll see: nodes 1-3 get a confchg with members=1,2,3,4,5 joined=4,5 nodes 4-5 get a confchg with members=1,2,3,4,5 joined=1,2,3 The issue of partitioning and merging has been a big issue over the years, and is a very serious problem for any application requiring the properties of virtual synchrony. VS guarantees that all cpg members will see the same sequence of messages and configuration changes, i.e. history of events. If a cpg is partitioned, that immediately violates VS. One part must be killed so that the remaining nodes will all agree on one version of history, thus maintaining VS. Partitioning can't be avoided, so an application must be able to deal with it and kill/stop one part (assuming the app depends on VS.) Once a partition exists, a merge back together doesn't change the fact that the disagreement has already occured (at partition time) and that disagreement can only be resolved (to maintain VS) by killing nodes that don't agree with one version of the history. My applications use quorum to block activity in minority partitions. They also exchange messages to detect merges of prior partitions, and then kill/block nodes that *were* in a minority partition to maintain VS in the majority. (Note that a *single* node:process joining the cpg doesn't mean that it wasn't partitioned by itself and is now merging.) corosync might make this easier by not merging cpg's (or even whole clusters) that have been partitioned, but that raises other questions and I've been told that doing it would be next to impossible. We have a lot of experience with these situations because of corosync's tendency to form spurious, transient partitions where a partition is created and then immediately merged again in fractions of a second. This doesn't happen much any more with small clusters, but it does when you get up toward 32 nodes. This is the most significant item on the list of suggested improvements I recently sent out. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] delayed shutdown
If I run 'cman_tool leave' on four nodes in parallel, node1 will leave right away, but the other three nodes don't leave until the token timeout expires for node1 causing a confchg for it, after which the other three all leave right away. This has only been annoying me recently, so I think it must have been some recent change. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Tue, Apr 14, 2009 at 02:05:10PM +0200, Dietmar Maurer wrote: So CPG provide a framework to implement distributed finite state machines (DFSM). But there is no standard way to get the initial state of the DFSM. Almost all applications need to get the initial state, so I wonder if it would make sense to provide a service which solves that problem (at least as a example). My current solution is: I introduce a CPG mode, which is either: DFSM_MODE_SYNC ... CPG is syncing state. only state synchronization messages allowed. Other messages are delayed/queued. DFSM_MODE_WORK ... STATE is synced accross all members - normal operation. Queued messages are delivered when we reach this state. When a new node joins, CPG immediately change mode to DFSM_MODE_SYNC. Then all members send their state. When a node received the states of all members, it computes the new state by merging all received states (dfsm_state_merge_fn), and finally switches mode to DFSM_MODE_WORK. Does that make sense? Yes. I'm not sure if a generic service for this would be used much or not... maybe. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
[Openais] improvements and optimizations
From one lone, biased, user's point of view, optimized malloc and memcpy are uninteresting -- message throughput isn't what I'm looking for. Are there others out there who see this as important? I *would* be interested in seeing improvements in the following areas: . message latency, if that's even possible . recovery speed, this seems to be getting worse, things often hang for up to many seconds when nodes join or leave these days . stability with much shorter token timeouts, we currently use 10 seconds as default, and I know corosync should work well with something much shorter, it just needs testing/validation along with some diagnostic methods to figure out when you're using something too short . stability with clusters up to 32 nodes, with diagnostic capabilities to immediately pinpoint the cause of a breakdown Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] improvements and optimizations
On Tue, Apr 14, 2009 at 01:18:14PM -0700, Steven Dake wrote: . message latency, if that's even possible Reducing the time a token is held reduces latency. So memcpy and malloc specials does reduce latency. I don't have measures of how much, however. That would be interesting to measure along with throughput, it's much more relevant for applications doing coordination or locking via messages. . recovery speed, this seems to be getting worse, things often hang for up to many seconds when nodes join or leave these days With trunk you see 2 second lags? I agree that the recovery engine needs work to allow background synchronization of data sets without blocking the entire cluster operation during the synchronization period. I'll try to measure it, my impression has been that it can be much longer than 2 seconds, but I've not been paying close attention. . stability with much shorter token timeouts, we currently use 10 seconds as default, and I know corosync should work well with something much shorter, it just needs testing/validation along with some diagnostic methods to figure out when you're using something too short yes with 16 nodes it should work with 100msec timeouts as long as the kernel doesn't take long locks. I think all those bugs are fixed in dlm/gfs now however. The 10 seconds in cluster 2 was to work around those essentially kernel lockups. There were a couple of spots in the kernel that were quickly fixed by calling schedule, they were never a big problem. IIRC the timeout was increased to 10 seconds because certain drivers or nics were doing resets which would stall network i/o. We were worried that that would be common in user environments, but I doubt it. Perhaps we can get some QE effort around identifying a shorter default. We need to choose something that works in our testing first. I think we should go ahead and change it to something like 2 seconds by default, and see what happens. . stability with clusters up to 32 nodes, with diagnostic capabilities to immediately pinpoint the cause of a breakdown This is a great idea, but the diags to pinpoint the cause are very difficult. I don't have a clear picture of how they would be designed but we have kicked around some ideas. Yes, this will require some careful analysis, both within corosync and the networking layers... and extended access to 16-32 nodes. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] detecting cpg joiners
On Thu, Apr 09, 2009 at 06:02:38PM -0700, Steven Dake wrote: The issue that Dave is talking about I believe is described in the following bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=489451 No, not at all. IMO you should get a leave event for any process that leaves the process group independent of how totem works underneath. CPG should provide the guarantees you seek, and if it doesn't, it is defective. OK, good. Here's what we expect: 0. configure token timeout to some long time that is longer than all the following steps take 1. cluster members are nodeid's: 1,2,3,4 2. cpg foo has the following members: nodeid 1, pid 10 nodeid 2, pid 20 nodeid 3, pid 30 nodeid 4, pid 40 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40 (optionally reboot this node now) 4. nodeid 4: ifup eth0, start corosync 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg showing that 4:40 is not a member 6. nodeid 4: start process pid 41 that joins cpg foo 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg showing that 4:41 is a member (Steps 6 and 7 should work the same even if the process started in step 6 has pid 40 instead of pid 41.) Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] detecting cpg joiners
On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote: On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote: 0. configure token timeout to some long time that is longer than all the following steps take 1. cluster members are nodeid's: 1,2,3,4 2. cpg foo has the following members: nodeid 1, pid 10 nodeid 2, pid 20 nodeid 3, pid 30 nodeid 4, pid 40 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40 (optionally reboot this node now) 4. nodeid 4: ifup eth0, start corosync 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg showing that 4:40 is not a member 6. nodeid 4: start process pid 41 that joins cpg foo 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg showing that 4:41 is a member (Steps 6 and 7 should work the same even if the process started in step 6 has pid 40 instead of pid 41.) 100% agree that is how it should work. If it doesn't, we will fix it. The only thing that may be strange is if pid in step 6 is the same pid as 40. Are you certain the test case which fails has a differing pid at step 6? If you fix step 5, then I suspect steps 6,7 will just work. After the test failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that the pid in step 6 was different (I didn't reboot the node). Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] detecting cpg joiners
On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote: For added fun, a node that restarts quickly enough (think a VM) won't even appear to have left (or rejoined) the cluster. At the next totem confchg event, It will simply just be there again with no indication that anything happened. At least this is true for the raw corosync/openais membership data, perhaps CPG can infer this some other way. Cpg should not let a node go away and come back without notice. In practice I'd expect back to back confchg's: one showing it leave and another showing it join. As Chrissie mentioned earlier, cpg shouldn't show the same node both leaving and joining in a single confchg. In theory I think it would be legitimate. Consider a couple examples. m: member list, j: joined list, l: left list 1. nodes A and B join at once A gets confchg: m=A,B j=A,B l= B gets confchg: m=A,B j=A,B l= 2. node C joins A gets confchg: m=A,B,C j=C l= B gets confchg: m=A,B,C j=C l= C gets confchg: m=A,B,C j=C l= 3. node C leaves and quickly rejoins in a single confchg A gets confchg: m=A,B,C j=C l=C B gets confchg: m=A,B,C j=C l=C C gets confchg: m=A,B,C j=C l=C 4. node D joins and quickly leaves (or fails) in a single confchg A gets confchg: m=A,B,C j=D l=D B gets confchg: m=A,B,C j=D l=D C gets confchg: m=A,B,C j=D l=D D gets confchg: m=A,B,C j=D l=D ?* * if D does a quick join+leave it may expect to see this confchg showing it in the joined list, the left list, and not in the member list. Again, the examples in 3 and 4 are, I think, legitimate in theory. In practice it sounds like they won't occur. If a quick leave+join is guaranteed to be visible through cpg, then it must be possible to observe at the lower level from raw corosync data. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] howto distribute data accross all nodes?
On Thu, Apr 09, 2009 at 09:00:08PM +0200, Dietmar Maurer wrote: If new, normal read/write messages to the replicated state continue while the new node is syncing the pre-existing state, the new node needs to save those operations to apply after it's synced. Ah, that probably works. But can lead to very high memory usage if traffic is high. If that's a problem you could block normal activity during the sync period. Is somebody really using that? If so, is there some code available (for safe/replay)? There is no general purpose code. dlm_controld is an example of a program doing something like this, http://git.fedorahosted.org/git/dlm.git It uses cpg to replicate state of posix locks, uses checkpoints to sync existing lock state to new nodes, and saves messages on a new node until it has completed syncing (i.e. reading pre-existing state from the checkpoint.) Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] detecting cpg joiners
On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote: On Thu, Apr 9, 2009 at 20:49, Joel Becker joel.bec...@oracle.com wrote: On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote: For added fun, a node that restarts quickly enough (think a VM) won't even appear to have left (or rejoined) the cluster. At the next totem confchg event, It will simply just be there again with no indication that anything happened. ? ? ? ?This had BETTER not happen. It does, I've seen it enough times that Pacemaker has code to deal with it. I'd call that a serious flaw we need to get fixed. I'll see if I can make it happen here. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [CRASH] corosync crash under load
On Tue, Mar 17, 2009 at 02:18:58PM +, Chrissie Caulfield wrote: I had three GFS filesystems all mounted on 13 nodes. When I went to umount them I got the following crash on 5 nodes of the system: (gdb) bt #0 0x7f21baeb0f05 in raise () from /lib64/libc.so.6 #1 0x7f21baeb2a73 in abort () from /lib64/libc.so.6 #2 0x7f21baef0438 in __libc_message () from /lib64/libc.so.6 #3 0x7f21baef5ec8 in malloc_printerr () from /lib64/libc.so.6 #4 0x7f21baef8486 in free () from /lib64/libc.so.6 #5 0x00dabdd2 in messages_free () at totemsrp.c:2233 #6 message_handler_orf_token (instance=0x7f21b89bd010, msg=value optimized out, msg_len=value optimized out, endian_conversion_needed=value optimized out) at totemsrp.c:3296 #7 0x00da4f44 in rrp_deliver_fn (context=0x1064180, msg=0x10824ac, msg_len=70) at totemrrp.c:1332 #8 0x00da2fdf in net_deliver_fn (handle=value optimized out, fd=value optimized out, revents=value optimized out, data=value optimized out) at totemnet.c:687 #9 0x00da0698 in poll_run (handle=7749363892505018368) at coropoll.c:409 #10 0x00404617 in main (argc=value optimized out, argv=value optimized out) at main.c:687 (gdb) frame 5 #5 0x00dabdd2 in messages_free () at totemsrp.c:2233 2233free (regular_message-iovec[j].iov_base); (gdb) p j $1 = 1 (gdb) p regular_message-iovec[j].iov_base $2 = (void *) 0xde6ebd8b4c81a096 (gdb) Here's another similar one while mounting/unmounting: Program terminated with signal 11, Segmentation fault. #0 0x7fa0ccf34159 in do_proc_join (name=0x7fffd78743a0, pid=11227, nodeid=2, reason=1) at cpg.c:740 740 if (pi-pid == pid pi-nodeid == nodeid) { (gdb) bt #0 0x7fa0ccf34159 in do_proc_join (name=0x7fffd78743a0, pid=11227, nodeid=2, reason=1) at cpg.c:740 #1 0x7fa0ccf34499 in message_handler_req_exec_cpg_procjoin (message=0x7fffd7874390, nodeid=2) at cpg.c:818 #2 0x00404829 in deliver_fn (nodeid=2, iovec=0x7fffd7874560, iov_len=1, endian_conversion_required=0) at main.c:433 #3 0x0031d3616e74 in app_deliver_fn (nodeid=2, iovec=0x7fffd7874540, iov_len=1, endian_conversion_required=0) at totempg.c:456 #4 0x0031d3616b68 in totempg_deliver_fn (nodeid=2, iovec=0x82f978, iov_len=1, endian_conversion_required=0) at totempg.c:600 #5 0x0031d3615e12 in totemmrp_deliver_fn (nodeid=2, iovec=0x82f978, iov_len=1, endian_conversion_required=0) at totemmrp.c:98 #6 0x0031d3613b8f in messages_deliver_to_app (instance=0x7fa0ce7d5010, skip=0, end_point=1467) at totemsrp.c:3599 #7 0x0031d3614064 in message_handler_mcast (instance=0x7fa0ce7d5010, msg=0x836ffc, msg_len=281, endian_conversion_needed=0) at totemsrp.c:3730 #8 0x0031d3615c32 in main_deliver_fn (context=0x7fa0ce7d5010, msg=0x836ffc, msg_len=281) at totemsrp.c:4173 #9 0x0031d3608ee8 in none_mcast_recv (rrp_instance=0x81c580, iface_no=0, context=0x7fa0ce7d5010, msg=0x836ffc, msg_len=281) at totemrrp.c:495 #10 0x0031d360aa43 in rrp_deliver_fn (context=0x81ca40, msg=0x836ffc, msg_len=281) at totemrrp.c:1343 #11 0x0031d3606faf in net_deliver_fn (handle=7749363892505018368, fd=7, revents=1, data=0x836950) at totemnet.c:687 #12 0x0031d360533c in poll_run (handle=7749363892505018368) at coropoll.c:409 #13 0x00405058 in main (argc=2, argv=0x7fffd78770c8) at main.c:687 ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] automake merged into corosync
On Tue, Mar 10, 2009 at 01:41:57AM -0700, Steven Dake wrote: ./autogen.sh ./configure make make install DESTDIR=/ Any chance that install could default to DESTDIR=/ ? Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [Cluster-devel] cluster/logging settings
On Thu, Oct 30, 2008 at 11:26:14PM -0700, Steven Dake wrote: There are two types of messages. Those intended for users/admins and those intended for developers. Both of these message types should always be recorded *somewhere*. The entire concept of LOG_LEVEL_DEBUG is dubious to me. If you want to stick with that symanetic and definition that is fine, but really a LOG_LEVEL_DEBUG means this message is for the developer. These messages should be recorded and stored when a process segfaults, aborts due to assertion, or at administrative request. Since the frequency of these messages is high there is no other option for recording them since they must _always_ be recorded for the purposes of debugging a field failure. Recording to disk or syslog has significant performance impact. The only solution for these types of messages is to record them into a flight recorder buffer which can be dumped: 1) at segv 2) at sigabt 3) at administrative request This is a fundamental difference in how we have approached logging debugging messages in the past but will lead to the ability to ensure we _always_ have debug trace data available instead of telling the user/admin Go turn on debug and hope you can reproduce that error and btw since 10k messages are logged your disk will fill up with irrelevant debug messages and your system will perform like mud. Logging these in memory is the only solution that I see as suitable and in all cases they should be filtered from any output source such as stderr, file, or syslog. There's a difference between high volume trace debug data stored in memory, and low volume informational debug data that can be easily written to a file. Both kinds of data can be useful. My programs are simple enough that low volume informational debug data is enough for me to identify and fix a problem. So, low volume informational data is all I produce. It can be useful to write this data to a file. Your program is complex enough that high volume trace debug data is usually needed for you to identify and fix a problem. So, high volume trace data is all you produce. This is too much data to write to a file (by the running program). So, we're using DEBUG to refer to different things. We need to define two different levels (just for clarity in this discussion): . DEBUGLO is low volume informational data like I use . DEBUGHI is high volume trace data like you use DEBUGHI messages wouldn't ever be logged to files by the program while running. DEBUGLO messages could be, though, if the user configured it. So, circling back around, how should a user configure DEBUGLO messages to appear in syslog or a logfile? In particular, what would they enter in the cluster.conf logging/ section? My suggestion is: syslog_level=foo logfile_level=bar where foo and bar are one of the standard priority names in syslog.h. So, if a user wanted DEBUGLO messages to appear in daemon.log, they'd set logging/daemon/logfile_level=debug and if they wanted DEBUGLO messages to appear in /var/log/messages, logging/daemon/syslog_level=debug (Note that debug means DEBUGLO here because DEBUGHI messages are only saved in memory, not to files by a running program.) There's another separate question I have about corosync, and that's whether you could identify some limited number of messages that would be appropriate for DEBUGLO? They would be used by non-experts to do some rough debugging of problems, and by experts to narrow down a problem before digging into the high volume trace data. I'd suggest that a good starting point for DEBUGLO would be the data that openais has historically put in /var/log/messages. Data that helps you quickly triage a problem (or verify that things are happening correctly) without stepping through all the trace data. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [Cluster-devel] cluster/logging settings
On Tue, Nov 04, 2008 at 02:58:47PM -0600, David Teigland wrote: the cluster.conf logging/ section? My suggestion is: syslog_level=foo logfile_level=bar FWIW, I'm not set on this if someone has a better suggestion. I just want something unambiguous. debug=on has been shown to mean something different to everyone. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] [RFC] simple blackbox
Wow that is a complicated solution. I though that simple and blackbox went well together. Completely agree, too complex. The logging code I copy into all the daemons I write is at the opposite end of the spectrum; I doubt it's possible to be much simpler. (I copy it everywhere because it's too short and simple to bother with a lib.) #define DUMP_SIZE (1024 * 1024) extern char dump_buf[DUMP_SIZE]; extern int dump_point; extern int dump_wrap; extern char daemon_debug_buf[256]; void daemon_dump_save(void) { int len, i; len = strlen(daemon_debug_buf); for (i = 0; i len; i++) { dump_buf[dump_point++] = daemon_debug_buf[i]; if (dump_point == DUMP_SIZE) { dump_point = 0; dump_wrap = 1; } } } #define log_debug(fmt, args...) \ do { \ snprintf(daemon_debug_buf, 255, %ld fmt \n, time(NULL), ##args); \ daemon_dump_save(); \ } while (0) That's it, just over 20 lines. I also have a function that will write dump_buf over a unix socket so a command line program can see it while the daemon is running (that's the only way I ever use it, actually). This is non-threaded, of course, and corosync will need something more complex, but the point is you can keep it simple. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] Split brain when using EVS library
On Tue, Sep 09, 2008 at 12:27:34PM +0200, Arne Eriksson R wrote: Hi, We have a cluster with 6 processors using openais stable version 0.80.3. For some reason our cluster splits up into two rings. Scenario is: node1(n1) n2 n3 n4 n5 n6 are in the ring. Suddenly the ring splits into two rings: n1 n2 n3 got leave msg from n4 n5 n6 n4 n5 n6 got leave msg from n1 n2 n3 After a few milliseconds the two rings joins again: n1 n2 n3 got join msg from n4 n5 n6 n4 n5 n6 got join msg from n1 n2 n3 The two ring is joined to one ring again: node1(n1) n2 n3 n4 n5 n6 are in the ring. We at RH have struggled a great deal with this exact feature for quite a long time. It's the biggest problem by far that we've had using openais. The question is if this is a normal scenario from EVS in the openais implementation? The problem is that the application needs to detect the difference between two kinds of joins: The normal join where the two rings/nodes join for the first time and the abnormal joins where a ring has split and re-joined (without any nodes being restarted). The first case typically requires only a sync of some nodes (bringing the history up to date). The second case requires a merger, i.e selection of a loosing side and the looser discarding the loosers history. Our applications (cman, dlm, gfs, etc using libcpg) need to make this same distinction: a join from a clean state where aisexec was just started, vs a join from a dirty state where the cluster experienced a transient partition (i.e. nodes split into two clusters and then aisexec automatically merged the two clusters back together again.) We've had to add the ability for our applications to detect that this has happened by sending messages containing the state of the app. And it makes things quite a bit more complicated than they should be. Dave ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] logsys patch
On Tue, Jul 01, 2008 at 03:11:26PM -0700, Steven Dake wrote: Dave, Your patch looks reasonable but has a few issues which need to be addressed. It doesn't address the setting of logsys_subsys_id but defines it. I want to avoid the situation where logsys_subsys_id is defined, but then not set. What I suggest here is to set logsys_subsys_id to some known value (-1) and assert if that the subsystem id is that value within log_printf to help developers catch this scenario. At the moment the current API enforces proper behavior (it wont link if the developer does the wrong thing). With your patch it will link, but may not behave properly sending log messages to the wrong subsystem (0) instead of the subsystem desired by the developer. This is why the macros are there (to set the subsystem id and define it). Your patch addresses the removal of the definition to a generic location but doesn't address at all the setting of the subsystem id. Good thought, done. The logsys_exit function doesn't need to be added. Instead this is managed by logsys_atsegv and logsys_flush. If you desire, you can keep logsys_exit and have it call logsys_flush which is the proper thing to do on exit. OK, added flush to exit, and reset logsys_subsys_id back to -1. Please follow the coding style guidelines (ie: match the rest of the logsys code) so I don't have to rework your patch before commit. OK, new patch attached. Dave Index: logsys.c === --- logsys.c(revision 1568) +++ logsys.c(working copy) @@ -632,3 +632,41 @@ { worker_thread_group_wait (log_thread_group); } + +int logsys_init (char *name, int mode, int facility, int priority, char *file) +{ + char *errstr; + + logsys_subsys_id = 0; + + strncpy (logsys_loggers[0].subsys, name, +sizeof (logsys_loggers[0].subsys)); + logsys_config_mode_set (mode); + logsys_config_facility_set (name, facility); + logsys_config_file_set (errstr, file); + _logsys_config_priority_set (0, priority); + if ((mode LOG_MODE_BUFFER_BEFORE_CONFIG) == 0) { + _logsys_wthread_create (); + } + return (0); +} + +int logsys_conf (char *name, int mode, int facility, int priority, char *file) +{ + char *errstr; + + strncpy (logsys_loggers[0].subsys, name, + sizeof (logsys_loggers[0].subsys)); + logsys_config_mode_set (mode); + logsys_config_facility_set (name, facility); + logsys_config_file_set (errstr, file); + _logsys_config_priority_set (0, priority); + return (0); +} + +void logsys_exit (void) +{ + logsys_subsys_id = -1; + logsys_flush (); +} + Index: logsys.h === --- logsys.h(revision 1568) +++ logsys.h(working copy) @@ -170,8 +170,9 @@ } \ } +static unsigned int logsys_subsys_id __attribute__((unused)) = -1; \ + #define LOGSYS_DECLARE_NOSUBSYS(priority) \ -static unsigned int logsys_subsys_id __attribute__((unused)); \ __attribute__ ((constructor)) static void logsys_nosubsys_init (void) \ { \ _logsys_nosubsys_set(); \ @@ -180,7 +181,6 @@ } #define LOGSYS_DECLARE_SUBSYS(subsys,priority) \ -static unsigned int logsys_subsys_id __attribute__((unused)); \ __attribute__ ((constructor)) static void logsys_subsys_init (void)\ { \ logsys_subsys_id = \ @@ -188,6 +188,7 @@ } #define log_printf(lvl, format, args...) do { \ + assert (logsys_subsys_id != -1);\ if ((lvl) = logsys_loggers[logsys_subsys_id].priority) { \ _logsys_log_printf2 (__FILE__, __LINE__, lvl, \ logsys_subsys_id, (format), ##args);\ @@ -195,6 +196,7 @@ } while(0) #define dprintf(format, args...) do { \ + assert (logsys_subsys_id != -1);\ if (LOG_LEVEL_DEBUG = logsys_loggers[logsys_subsys_id].priority) { \ _logsys_log_printf2 (__FILE__, __LINE__, LOG_DEBUG, \ logsys_subsys_id, (format), ##args);\ @@ -202,6 +204,7 @@ } while(0) #define ENTER_VOID() do { \ + assert (logsys_subsys_id != -1);\ if (LOG_LEVEL_DEBUG = logsys_loggers[logsys_subsys_id].priority) { \ _logsys_trace (__FILE__, __LINE__, LOGSYS_TAG_ENTER,\
Re: [Openais] new config system
On Wed, Mar 26, 2008 at 11:57:59AM -0400, Lon Hohberger wrote: On Wed, 2008-03-26 at 10:32 -0500, David Teigland wrote: [1] Just to be clear, the meta-configuration idea is where a variety of config files can be used to populate a central config-file-agnostic respository. A single interface is used by all to read config data from the repository. Even if we did this, I don't see what it would give us anything. All our existing applications access data that's only specified in a single config file anyway, so interchangable back-end files would be an unused feature. True, it doesn't give _us_ much to be agnostic to what the config file format looks like. However, with different back-ends used to populate the single config repo at run-time, we then have the ability to not have config files at all (well, except the meta-config stuff). What I mean is: An administrator might like to store the cluster configuration in an inventory database which isn't local to the cluster itself (e.g. LDAP, or whatever). This might not be a requirement now, but that was one of the points of having multiple config back-ends, IIRC. That's what I had in mind with the other? arrow pointing up at libcmanconf. Multiple back-ends for libcmanconf is one thing (good, simple); multiple back-ends for a meta-configuration database with a meta-API is what I've become skeptical about. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais
Re: [Openais] new config system
On Wed, Mar 26, 2008 at 10:32:54AM -0500, David Teigland wrote: A while back I drew this diagram to show what we were aiming to design, in broad terms, for the next generation aisexec/cman config system: http://people.redhat.com/teigland/cman3.jpg I think perhaps that diagram attempts to do too much, and I've drawn another: http://people.redhat.com/teigland/cman3b.jpg The big problem I see with the first diagram is that it tries to use objdb to solve the meta-configuration problem [1]. That's a hard problem, I'm not sure objdb is the right place to solve it, I don't think we have enough information to solve it properly right now, and I don't see that we have a pressing need to solve it right now. So, the second diagram steps back to what Fabio has already implemented, more or less. There were quite a few things wrong in the cman3b diagram, so based on the explanation from Chrissie and Fabio, here's another: http://people.redhat.com/teigland/cman3c.jpg (The assumes comments don't mean it would be impossible to use one lib with a different config plugin, but that it wouldn't make sense to do so in practice.) Lon pointed out another problem with the first diagram, and that's that we want to be able to read config values without openais running, and running properly. That's one of the things we were trying to get away from with ccsd. The cman3c diagram does not solve this problem, but it could by caching a local copy of the config data to use when aisexec is not running. ___ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais