Hi, I met a problem with whitetank. #ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope host lo inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:1c:23:00:5a:8a brd ff:ff:ff:ff:ff:ff inet 147.2.207.210/24 brd 147.2.207.255 scope global eth0 inet6 fe80::21c:23ff:fe00:5a8a/64 scope link valid_lft forever preferred_lft forever
Note: There're multiple IPs on lo. If I default all the timeout options in totem directive, everything goes all right. Though if setting "consensus" to a value greater than "downcheck", for example: totem { version: 2 secauth: off threads: 0 consensus: 2500 interface { ringnumber: 0 bindnetaddr: 147.2.207.0 mcastaddr: 226.94.1.1 mcastport: 5495 } } logging { to_stderr: yes to_file: yes logfile: /tmp/ais debug: on timestamp: on } amf { mode: disabled } And once I delete the listening IP: # ip addr del 147.2.207.210/24 brd 147.2.207.255 dev eth0 Segmentation fault happens to aisexec: #0 0xb7e4a200 in strcpy () from /lib/libc.so.6 #1 0xb74dbdf7 in my_cluster_node_load () at clm.c:280 #2 0xb74dc879 in clm_confchg_fn (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, member_list_entries=1, left_list=0xbf815084, left_list_entries=0, joined_list=0x0, joined_list_entries=0, ring_id=0xb7529664) at clm.c:538 #3 0x08063d71 in confchg_fn (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, member_list_entries=1, left_list=0xbf815084, left_list_entries=0, joined_list=0x0, joined_list_entries=0, ring_id=0xb7529664) at main.c:213 #4 0x0805df4e in app_confchg_fn (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, member_list_entries=1, left_list=0xbf815084, left_list_entries=0, joined_list=0x0, joined_list_entries=0, ring_id=0xb7529664) at totempg.c:327 #5 0x0805de6a in totempg_confchg_fn (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, member_list_entries=1, left_list=0xbf815084, left_list_entries=0, joined_list=0x0, joined_list_entries=0, ring_id=0xb7529664) at totempg.c:480 #6 0x0805d964 in totemmrp_confchg_fn (configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, member_list_entries=1, left_list=0xbf815084, left_list_entries=0, joined_list=0x0, joined_list_entries=0, ring_id=0xb7529664) at totemmrp.c:92 #7 0x08056b30 in memb_state_operational_enter (instance=0xb7508008) at totemsrp.c:1635 #8 0x0805aeeb in message_handler_orf_token (instance=0xb7508008, msg=0x8193574, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3402 #9 0x0805d757 in main_deliver_fn (context=0xb7508008, msg=0x8193574, msg_len=70) at totemsrp.c:4131 #10 0x08051ac6 in none_token_recv (rrp_instance=0x8192ae0, iface_no=0, context=0xb7508008, msg=0x8193574, msg_len=70, token_seq=3) at totemrrp.c:506 #11 0x080533e6 in rrp_deliver_fn (context=0x8185d18, msg=0x8193574, msg_len=70) at totemrrp.c:1308 #12 0x0804fa68 in net_deliver_fn (handle=0, fd=6, revents=1, data=0x8192f48) at totemnet.c:695 #13 0x0804de7b in poll_run (handle=0) at aispoll.c:402 #14 0x08064db0 in main (argc=2, argv=0xbf81bb24) at main.c:623 There's the same problem when setting an alias IP on eth0 with a different network address from "bindnetaddr" So I patched clm.c to trace the problem: --- clm.c.orig 2009-01-26 05:44:55.000000000 +0800 +++ clm.c 2009-08-31 18:28:10.000000000 +0800 @@ -268,13 +268,21 @@ unsigned int iface_count; char **status; const char *iface_string; + int my_nodeid; + int res; - totempg_ifaces_get ( - totempg_my_nodeid_get (), + my_nodeid = totempg_my_nodeid_get (); + res = totempg_ifaces_get ( + my_nodeid, interfaces, &status, &iface_count); + if (res != 0) { + log_printf (LOG_LEVEL_ERROR, "Cannot get interfaces for my_nodeid: %x", my_nodeid); + assert (0) ; + } + iface_string = totemip_print (&interfaces[0]); sprintf ((char *)my_cluster_node.node_address.value, "%s", The outputs were like the below: Aug 31 18:29:08.802934 [MAIN ] AIS Executive Service RELEASE 'subrev 1152 version 0.80' Aug 31 18:29:08.803180 [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Aug 31 18:29:08.803232 [MAIN ] Copyright (C) 2006 Red Hat, Inc. Aug 31 18:29:08.803274 [MAIN ] AIS Executive Service: started and ready to provide service. Aug 31 18:29:08.803315 [print.c:0361] log setup Aug 31 18:29:08.823250 [TOTEM] Token Timeout (1000 ms) retransmit timeout (238 ms) Aug 31 18:29:08.823342 [TOTEM] token hold (180 ms) retransmits before loss (4 retrans) Aug 31 18:29:08.823366 [TOTEM] join (50 ms) send_join (0 ms) consensus (2500 ms) merge (200 ms) Aug 31 18:29:08.823386 [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) Aug 31 18:29:08.823404 [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 Aug 31 18:29:08.823423 [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) Aug 31 18:29:08.823441 [TOTEM] send threads (0 threads) Aug 31 18:29:08.823457 [TOTEM] RRP token expired timeout (238 ms) Aug 31 18:29:08.823474 [TOTEM] RRP token problem counter (2000 ms) Aug 31 18:29:08.823491 [TOTEM] RRP threshold (10 problem count) Aug 31 18:29:08.823507 [TOTEM] RRP mode set to none. Aug 31 18:29:08.823524 [TOTEM] heartbeat_failures_allowed (0) Aug 31 18:29:08.823540 [TOTEM] max_network_delay (50 ms) Aug 31 18:29:08.823594 [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 Aug 31 18:29:08.824018 [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Aug 31 18:29:08.824049 [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Aug 31 18:29:08.824439 [TOTEM] The network interface [147.2.207.210] is now up. Aug 31 18:29:08.824503 [TOTEM] Created or loaded sequence id 42284.147.2.207.210 for this ring. Aug 31 18:29:08.824652 [TOTEM] entering GATHER state from 15. Aug 31 18:29:08.826367 [SERV ] Service initialized 'openais extended virtual synchrony service' Aug 31 18:29:08.827577 [SERV ] Service initialized 'openais cluster membership service B.01.01' Aug 31 18:29:08.827903 [SERV ] Service initialized 'openais availability management framework B.01.01' Aug 31 18:29:08.828319 [SERV ] Service initialized 'openais checkpoint service B.01.01' Aug 31 18:29:08.829046 [SERV ] Service initialized 'openais event service B.01.01' Aug 31 18:29:08.829744 [SERV ] Service initialized 'openais distributed locking service B.01.01' Aug 31 18:29:08.830559 [SERV ] Service initialized 'openais message service B.01.01' Aug 31 18:29:08.830832 [SERV ] Service initialized 'openais configuration service' Aug 31 18:29:08.831267 [SERV ] Service initialized 'openais cluster closed process group service v1.01' Aug 31 18:29:08.831540 [SERV ] Service initialized 'openais cluster config database access v1.01' Aug 31 18:29:08.831574 [SYNC ] Not using a virtual synchrony filter. Aug 31 18:29:08.831684 [TOTEM] Creating commit token because I am the rep. Aug 31 18:29:08.831726 [TOTEM] Saving state aru 0 high seq received 0 Aug 31 18:29:08.831773 [TOTEM] Storing new sequence id for ring a530 Aug 31 18:29:08.831903 [TOTEM] entering COMMIT state. Aug 31 18:29:08.831946 [TOTEM] entering RECOVERY state. Aug 31 18:29:08.832026 [TOTEM] position [0] member 147.2.207.210: Aug 31 18:29:08.832050 [TOTEM] previous ring seq 42284 rep 147.2.207.210 Aug 31 18:29:08.832069 [TOTEM] aru 0 high delivered 0 received flag 1 Aug 31 18:29:08.832087 [TOTEM] Did not need to originate any messages in recovery. Aug 31 18:29:08.832127 [TOTEM] Sending initial ORF token Aug 31 18:29:08.832363 [CLM ] CLM CONFIGURATION CHANGE Aug 31 18:29:08.832389 [CLM ] New Configuration: Aug 31 18:29:08.832406 [CLM ] Members Left: Aug 31 18:29:08.832423 [CLM ] Members Joined: Aug 31 18:29:08.832489 [CLM ] CLM CONFIGURATION CHANGE Aug 31 18:29:08.832511 [CLM ] New Configuration: Aug 31 18:29:08.832535 [CLM ] r(0) ip(147.2.207.210) Aug 31 18:29:08.832554 [CLM ] Members Left: Aug 31 18:29:08.832570 [CLM ] Members Joined: Aug 31 18:29:08.832591 [CLM ] r(0) ip(147.2.207.210) Aug 31 18:29:08.832623 [SYNC ] This node is within the primary component and will provide service. Aug 31 18:29:08.832662 [TOTEM] entering OPERATIONAL state. Aug 31 18:29:08.834725 [CLM ] got nodejoin message 147.2.207.210 Aug 31 18:29:12.890629 [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Aug 31 18:29:12.890714 [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Aug 31 18:29:12.891060 [TOTEM] The network interface is down. Aug 31 18:29:12.891180 [TOTEM] entering GATHER state from 15. aisexec: clm.c:283: my_cluster_node_load: Assertion `0' failed. Aug 31 18:29:15.405300 [TOTEM] entering GATHER state from 0. Aug 31 18:29:15.405381 [TOTEM] Creating commit token because I am the rep. Aug 31 18:29:15.405410 [TOTEM] Saving state aru c high seq received c Aug 31 18:29:15.405452 [TOTEM] Storing new sequence id for ring a534 Aug 31 18:29:15.405574 [TOTEM] entering COMMIT state. Aug 31 18:29:15.405614 [TOTEM] entering RECOVERY state. Aug 31 18:29:15.405684 [TOTEM] position [0] member 127.0.0.1: Aug 31 18:29:15.405708 [TOTEM] previous ring seq 42288 rep 147.2.207.210 Aug 31 18:29:15.405726 [TOTEM] aru c high delivered c received flag 1 Aug 31 18:29:15.405744 [TOTEM] Did not need to originate any messages in recovery. Aug 31 18:29:15.405782 [TOTEM] Sending initial ORF token Aug 31 18:29:15.406020 [CLM ] CLM CONFIGURATION CHANGE Aug 31 18:29:15.406045 [CLM ] New Configuration: Aug 31 18:29:15.406071 [CLM ] r(0) ip(127.0.0.1) Aug 31 18:29:15.406089 [CLM ] Members Left: Aug 31 18:29:15.406104 [CLM ] Members Joined: Aug 31 18:29:15.406124 [CLM ] Cannot get interfaces for my_nodeid: 200007f So the boudto.nodid became "127.0.0.2", but it could not be found from my_memb_list. ..At last I found something in totemip.c: totemip_iface_check(). In the funtion, whether or not an appropriate interface can be found, the "boundto" argument will be always updated in the end. totemip.c:587: totemip_copy (boundto, &ipaddr); I applied a patch as the below: --- openais.orig/exec/totemip.c 2009-08-31 19:30:43.000000000 +0800 +++ openais/exec/totemip.c 2009-08-31 19:03:23.000000000 +0800 @@ -497,7 +497,7 @@ h = (struct nlmsghdr *)rcvbuf; if (h->nlmsg_type == NLMSG_DONE) - break; + return -1; if (h->nlmsg_type == NLMSG_ERROR) { close(fd); --- openais.org/exec/clm.c 2009-01-26 05:44:55.000000000 +0800 +++ openais/exec/clm.c 2009-08-31 20:57:18.000000000 +0800 @@ -268,13 +268,21 @@ unsigned int iface_count; char **status; const char *iface_string; + int my_nodeid; + int res; - totempg_ifaces_get ( - totempg_my_nodeid_get (), + my_nodeid = totempg_my_nodeid_get (); + res = totempg_ifaces_get ( + my_nodeid, interfaces, &status, &iface_count); + if (res != 0) { + log_printf (LOG_LEVEL_DEBUG, "Cannot get interfaces for my_nodeid: %x", my_nodeid); + return ; + } + iface_string = totemip_print (&interfaces[0]); sprintf ((char *)my_cluster_node.node_address.value, "%s", It seems to reslove the problem. Otherwise there was any other consideration? I haven't tried corosync + openais 1.0. No idea if it has the same issue. Thanks, Yan -- Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As Oneā¢ _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais