On Mon, Oct 05, 2009 at 03:49:42PM +0200, Jerome Flesch wrote:
> Hello,

Hi, Jerome.

> I'm still stress-testing Corosync/Openais (trunk) on FreeBSD, and I've found 
> out a tiny bug:
> 
> On peer A, my test program calls:
> - saClmInitialize()
> - saClmClusterTrack(SA_TRACK_CURRENT | SA_TRACK_CHANGES)
> - saClmFinalize()
> - (does various tests with CPG ..)
> Next, when I shut down/kill Corosync on peer B, Corosync on peer A segfaults.

Can you provide the exact test program? I'd like to see all the
details of each API call. Are you using a test program from the
openais tree or did you write your own.

> When my test program calls saClmClusterTrackStop() before saClmFinalize,
> Corosync doesn't crash on peer B. From that and the stacktrace
> (joined below)
>
> I guess it tries to signal the change in the cluster to a program that is not
> connected anymore (-> missing disconnection notification to CLM ?). I also
> guess it means that Corosync will segfault if the client itself crashes.

I'm guessing that a callback is sent to node A. If I understand, you
are enabling tracking on group A, correct? If CLM is anything like MSG
service (and I think it is with respect to how tracking works),
enabling tracking will generate callbacks on membership changes *to
the node that enabled tracking.

> By the way, it's *not* due to a BSD-ism ;) (I've also tested on a small 
> Debian cluster).
> 
> The core of the crashed Corosync gives me the following stacktrace:

Thanks. I'll take a look. But like I said, if you can provide the test
program, that would be great.

Ryan

> -----
> (gdb) bt
> #0  0x28501120 in library_notification_send 
> (cluster_notification_entries=0x3fbd36a0, notify_count=2) at clm.c:429
> #1  0x285014ca in lib_notification_leave (nodes=0x3fbf76f0, nodes_entries=2) 
> at clm.c:524
> #2  0x2850189c in clm_confchg_fn 
> (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0x3fbf8a14, 
> member_list_entries=1, left_list=0x3fbf7e14, left_list_entries=2, 
>     joined_list=0x0, joined_list_entries=0, ring_id=0x2833765c) at clm.c:584
> #3  0x0804ba7b in confchg_fn 
> (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0x3fbf8a14, 
> member_list_entries=1, left_list=0x3fbf7e14, left_list_entries=2, 
> joined_list=0x0, 
>     joined_list_entries=0, ring_id=0x2833765c) at main.c:324
> #4  0x280a2a2f in app_confchg_fn 
> (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0x3fbf8a14, 
> member_list_entries=1, left_list=0x3fbf7e14, left_list_entries=2, 
>     joined_list=0x0, joined_list_entries=0, ring_id=0x2833765c) at 
> totempg.c:350
> #5  0x280a2932 in totempg_confchg_fn 
> (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0x3fbf8a14, 
> member_list_entries=1, left_list=0x3fbf7e14, left_list_entries=2, 
>     joined_list=0x0, joined_list_entries=0, ring_id=0x2833765c) at 
> totempg.c:524
> #6  0x280a2343 in totemmrp_confchg_fn 
> (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0x3fbf8a14, 
> member_list_entries=1, left_list=0x3fbf7e14, left_list_entries=2, 
>     joined_list=0x0, joined_list_entries=0, ring_id=0x2833765c) at 
> totemmrp.c:109
> #7  0x2809ad3f in memb_state_operational_enter (instance=0x28316000) at 
> totemsrp.c:1678
> #8  0x2809f890 in message_handler_orf_token (instance=0x28316000, 
> msg=0x28381638, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3484
> #9  0x280a20bb in main_deliver_fn (context=0x28316000, msg=0x28381638, 
> msg_len=70) at totemsrp.c:4212
> #10 0x28095bb2 in none_token_recv (rrp_instance=0x282fe400, iface_no=0, 
> context=0x28316000, msg=0x28381638, msg_len=70, token_seq=3) at totemrrp.c:536
> #11 0x28097849 in rrp_deliver_fn (context=0x28206190, msg=0x28381638, 
> msg_len=70) at totemrrp.c:1393
> #12 0x28093cf5 in net_deliver_fn (handle=7749363892505018368, fd=7, 
> revents=1, data=0x28381000) at totemudp.c:1223
> #13 0x28091d44 in poll_run (handle=7749363892505018368) at coropoll.c:394
> #14 0x0804d432 in main (argc=2, argv=0x3fbfece4) at main.c:1069
> (gdb) print *cluster_notification_entries 
> $1 = {cluster_node = {node_id = 2, node_address = {length = 11, family = 
> MAR_CLM_AF_INET, value = "172.16.10.2", '\0' <repeats 52 times>}, node_name = 
> {length = 11, 
>       value = "172.16.10.2", '\0' <repeats 244 times>}, member = 0, 
> boot_timestamp = 1254723307000000000, initial_view_number = 617}, 
> cluster_change = MAR_NODE_LEFT}
> (gdb) print clm_pd
> $2 = (struct clm_pd *) 0xc3fbbe18
> (gdb) print *clm_pd
> Cannot access memory at address 0xc3fbbe18
> (gdb) info threads 
> * 3 Thread 0x28201040 (LWP 100188)  0x28501120 in library_notification_send 
> (cluster_notification_entries=0x3fbd36a0, notify_count=2) at clm.c:429
>   2 Thread 0x28201150 (LWP 100284)  0x2815fe3f in poll () at poll.S:2
>   1 Thread 0x282019d0 (LWP 100286)  0x2811f0fb in semop () at semop.S:2
> -----
> 
> Hope it helps.
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to