Hi,

On Wed, Dec 01, 2010 at 05:30:44PM +0200, Vladislav Bogdanov wrote:
> 01.12.2010 16:32, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote:
> >> Hi Steven, hi all.
> >>
> >> I often see this assert on one of nodes after I stop corosync on some
> >> another node in newly-setup 4-node cluster.
> > 
> > Does the assert happen on a node lost event? Or once new
> > partition is formed?
> 
> I first noticed it when I rebooted another node, just after console said
> that OpenAIS is stopped.
> 
> Can't say right now, what exactly event did it follow, I'm actually
> fighting with several problems with corosync, pacemaker, NFS4 and
> phantom uncorrectable ECC errors simultaneously and I'm a bit lost with
> all of them.
> 
> > 
> >> #0  0x00007f51953e49a5 in raise () from /lib64/libc.so.6
> >> #1  0x00007f51953e6185 in abort () from /lib64/libc.so.6
> >> #2  0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
> >> #3  0x00007f5196176406 in memb_consensus_agreed
> >> (instance=0x7f5196554010) at totemsrp.c:1194
> >> #4  0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
> >> memb_join=0x262f628) at totemsrp.c:3918
> >> #5  0x00007f519617b619 in message_handler_memb_join
> >> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
> >> optimized out>, endian_conversion_needed=<value optimized out>)
> >>     at totemsrp.c:4161
> >> #6  0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
> >> iface_no=0, context=<value optimized out>, msg=<value optimized out>,
> >> msg_len=<value optimized out>) at totemrrp.c:720
> >> #7  0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
> >> msg=0x262f628, msg_len=420) at totemrrp.c:1404
> >> #8  0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
> >> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
> >> at totemudp.c:1244
> >> #9  0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
> >> coropoll.c:510
> >> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
> >> optimized out>, envp=<value optimized out>) at main.c:1680
> >>
> >> Last fplay lines are:
> >>
> >> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
> >> pending delivery queue
> >> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
> >> pending delivery queue
> >> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> >> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> >> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> >> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> >> rec=[36130] Log Message=releasing messages up to and including 1367
> >> rec=[36131] Log Message=FAILED TO RECEIVE
> >> rec=[36132] Log Message=entering GATHER state from 6.
> >> rec=[36133] Log Message=entering GATHER state from 0.
> >> Finishing replay: records found [33993]
> >>
> >> What could be the reason for this? Bug, switches, memory errors?
> > 
> > The assertion fails because corosync finds out that
> > instance->my_proc_list and instance->my_failed_list are
> > equal. That happens immediately after the "FAILED TO RECEIVE"
> > message which is issued when fail_recv_const token rotations
> > happened without any multicast packet received (defaults to 50).

I took a look at the code and the protocol specification again
and it seems like that assert is not valid since Steve patched
the part dealing with the "FAILED TO RECEIVE" condition. The
patch is from 2010-06-03 posted to the list here
http://marc.info/?l=openais&m=127559807608484&w=2

The last hunk of the patch contains this code (exec/totemsrp.c):

3933         if (memb_consensus_agreed (instance) && instance->failed_to_recv 
== 1) {      
3934                 instance->failed_to_recv = 0;
3935                 srp_addr_copy (&instance->my_proc_list[0],
3936                     &instance->my_id);
3937                 instance->my_proc_list_entries = 1;
3938                 instance->my_failed_list_entries = 0;
3939            
3940                 memb_state_commit_token_create (instance);
3941            
3942                 memb_state_commit_enter (instance);
3943                 return;
3944         }

This code never got a chance to run because on failed_to_recv
the two sets (my_process_list and my_failed_list) are equal which
makes the assert fail in memb_consensus_agreed():

1185     memb_set_subtract (token_memb, &token_memb_entries,
1186         instance->my_proc_list, instance->my_proc_list_entries,
1187         instance->my_failed_list, instance->my_failed_list_entries);
...
1195     assert (token_memb_entries >= 1);

In other words, it's something like this:

        if A:
                if memb_consensus_agreed() and failed_to_recv:
                        form a single node ring and try to recover

        memb_consensus_agreed():
                assert(!A)

Steve, can you take a look and confirm that this holds.

Cheers,

Dejan
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to