Hi, On Wed, Dec 01, 2010 at 05:30:44PM +0200, Vladislav Bogdanov wrote: > 01.12.2010 16:32, Dejan Muhamedagic wrote: > > Hi, > > > > On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote: > >> Hi Steven, hi all. > >> > >> I often see this assert on one of nodes after I stop corosync on some > >> another node in newly-setup 4-node cluster. > > > > Does the assert happen on a node lost event? Or once new > > partition is formed? > > I first noticed it when I rebooted another node, just after console said > that OpenAIS is stopped. > > Can't say right now, what exactly event did it follow, I'm actually > fighting with several problems with corosync, pacemaker, NFS4 and > phantom uncorrectable ECC errors simultaneously and I'm a bit lost with > all of them. > > > > >> #0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6 > >> #1 0x00007f51953e6185 in abort () from /lib64/libc.so.6 > >> #2 0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6 > >> #3 0x00007f5196176406 in memb_consensus_agreed > >> (instance=0x7f5196554010) at totemsrp.c:1194 > >> #4 0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010, > >> memb_join=0x262f628) at totemsrp.c:3918 > >> #5 0x00007f519617b619 in message_handler_memb_join > >> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value > >> optimized out>, endian_conversion_needed=<value optimized out>) > >> at totemsrp.c:4161 > >> #6 0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030, > >> iface_no=0, context=<value optimized out>, msg=<value optimized out>, > >> msg_len=<value optimized out>) at totemrrp.c:720 > >> #7 0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>, > >> msg=0x262f628, msg_len=420) at totemrrp.c:1404 > >> #8 0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>, > >> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80) > >> at totemudp.c:1244 > >> #9 0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at > >> coropoll.c:510 > >> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value > >> optimized out>, envp=<value optimized out>) at main.c:1680 > >> > >> Last fplay lines are: > >> > >> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to > >> pending delivery queue > >> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to > >> pending delivery queue > >> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366 > >> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367 > >> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366 > >> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367 > >> rec=[36130] Log Message=releasing messages up to and including 1367 > >> rec=[36131] Log Message=FAILED TO RECEIVE > >> rec=[36132] Log Message=entering GATHER state from 6. > >> rec=[36133] Log Message=entering GATHER state from 0. > >> Finishing replay: records found [33993] > >> > >> What could be the reason for this? Bug, switches, memory errors? > > > > The assertion fails because corosync finds out that > > instance->my_proc_list and instance->my_failed_list are > > equal. That happens immediately after the "FAILED TO RECEIVE" > > message which is issued when fail_recv_const token rotations > > happened without any multicast packet received (defaults to 50).
I took a look at the code and the protocol specification again and it seems like that assert is not valid since Steve patched the part dealing with the "FAILED TO RECEIVE" condition. The patch is from 2010-06-03 posted to the list here http://marc.info/?l=openais&m=127559807608484&w=2 The last hunk of the patch contains this code (exec/totemsrp.c): 3933 if (memb_consensus_agreed (instance) && instance->failed_to_recv == 1) { 3934 instance->failed_to_recv = 0; 3935 srp_addr_copy (&instance->my_proc_list[0], 3936 &instance->my_id); 3937 instance->my_proc_list_entries = 1; 3938 instance->my_failed_list_entries = 0; 3939 3940 memb_state_commit_token_create (instance); 3941 3942 memb_state_commit_enter (instance); 3943 return; 3944 } This code never got a chance to run because on failed_to_recv the two sets (my_process_list and my_failed_list) are equal which makes the assert fail in memb_consensus_agreed(): 1185 memb_set_subtract (token_memb, &token_memb_entries, 1186 instance->my_proc_list, instance->my_proc_list_entries, 1187 instance->my_failed_list, instance->my_failed_list_entries); ... 1195 assert (token_memb_entries >= 1); In other words, it's something like this: if A: if memb_consensus_agreed() and failed_to_recv: form a single node ring and try to recover memb_consensus_agreed(): assert(!A) Steve, can you take a look and confirm that this holds. Cheers, Dejan _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais