[Openais] [whitetank] segfault when an interface has multiple IPs

Yan Gao Mon, 31 Aug 2009 07:09:56 -0700

Hi,
I met a problem with whitetank.

#ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP 
qlen 1000
    link/ether 00:1c:23:00:5a:8a brd ff:ff:ff:ff:ff:ff
    inet 147.2.207.210/24 brd 147.2.207.255 scope global eth0
    inet6 fe80::21c:23ff:fe00:5a8a/64 scope link
       valid_lft forever preferred_lft forever


Note: There're multiple IPs on lo.

If I default all the timeout options in totem directive, everything goes all 
right.

Though if setting "consensus" to a value greater than "downcheck", for example:

totem {
        version: 2
        secauth: off
        threads: 0
        consensus: 2500
        interface {
                ringnumber: 0
                bindnetaddr: 147.2.207.0
                mcastaddr: 226.94.1.1
                mcastport: 5495
        }
}

logging {
        to_stderr: yes
        to_file: yes
        logfile: /tmp/ais
        debug: on
        timestamp: on
}

amf {
        mode: disabled
}


And once I delete the listening IP:
# ip addr del 147.2.207.210/24 brd 147.2.207.255 dev eth0

Segmentation fault happens to aisexec:
#0  0xb7e4a200 in strcpy () from /lib/libc.so.6
#1  0xb74dbdf7 in my_cluster_node_load () at clm.c:280
#2  0xb74dc879 in clm_confchg_fn 
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, 
member_list_entries=1, left_list=0xbf815084,
    left_list_entries=0, joined_list=0x0, joined_list_entries=0, 
ring_id=0xb7529664) at clm.c:538
#3  0x08063d71 in confchg_fn 
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, 
member_list_entries=1, left_list=0xbf815084,
    left_list_entries=0, joined_list=0x0, joined_list_entries=0, 
ring_id=0xb7529664) at main.c:213
#4  0x0805df4e in app_confchg_fn 
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, 
member_list_entries=1, left_list=0xbf815084,
    left_list_entries=0, joined_list=0x0, joined_list_entries=0, 
ring_id=0xb7529664) at totempg.c:327
#5  0x0805de6a in totempg_confchg_fn 
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, 
member_list_entries=1,
    left_list=0xbf815084, left_list_entries=0, joined_list=0x0, 
joined_list_entries=0, ring_id=0xb7529664) at totempg.c:480
#6  0x0805d964 in totemmrp_confchg_fn 
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84, 
member_list_entries=1,
    left_list=0xbf815084, left_list_entries=0, joined_list=0x0, 
joined_list_entries=0, ring_id=0xb7529664) at totemmrp.c:92
#7  0x08056b30 in memb_state_operational_enter (instance=0xb7508008) at 
totemsrp.c:1635
#8  0x0805aeeb in message_handler_orf_token (instance=0xb7508008, 
msg=0x8193574, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3402
#9  0x0805d757 in main_deliver_fn (context=0xb7508008, msg=0x8193574, 
msg_len=70) at totemsrp.c:4131
#10 0x08051ac6 in none_token_recv (rrp_instance=0x8192ae0, iface_no=0, 
context=0xb7508008, msg=0x8193574, msg_len=70, token_seq=3) at totemrrp.c:506
#11 0x080533e6 in rrp_deliver_fn (context=0x8185d18, msg=0x8193574, msg_len=70) 
at totemrrp.c:1308
#12 0x0804fa68 in net_deliver_fn (handle=0, fd=6, revents=1, data=0x8192f48) at 
totemnet.c:695
#13 0x0804de7b in poll_run (handle=0) at aispoll.c:402
#14 0x08064db0 in main (argc=2, argv=0xbf81bb24) at main.c:623


There's the same problem when setting an alias IP on eth0 with a different 
network address from "bindnetaddr"

So I patched clm.c to trace the problem:

--- clm.c.orig  2009-01-26 05:44:55.000000000 +0800
+++ clm.c       2009-08-31 18:28:10.000000000 +0800
@@ -268,13 +268,21 @@
        unsigned int iface_count;
        char **status;
        const char *iface_string;
+       int my_nodeid;
+       int res;

-       totempg_ifaces_get (
-               totempg_my_nodeid_get (),
+       my_nodeid = totempg_my_nodeid_get ();
+       res = totempg_ifaces_get (
+               my_nodeid,
                interfaces,
                &status,
                &iface_count);

+       if (res != 0) {
+               log_printf (LOG_LEVEL_ERROR, "Cannot get interfaces for 
my_nodeid: %x", my_nodeid);
+               assert (0) ;
+       }
+
        iface_string = totemip_print (&interfaces[0]);

        sprintf ((char *)my_cluster_node.node_address.value, "%s",


The outputs were like the below:

Aug 31 18:29:08.802934 [MAIN ] AIS Executive Service RELEASE 'subrev 1152 
version 0.80'
Aug 31 18:29:08.803180 [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc 
and contributors.
Aug 31 18:29:08.803232 [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Aug 31 18:29:08.803274 [MAIN ] AIS Executive Service: started and ready to 
provide service.
Aug 31 18:29:08.803315 [print.c:0361] log setup
Aug 31 18:29:08.823250 [TOTEM] Token Timeout (1000 ms) retransmit timeout (238 
ms)
Aug 31 18:29:08.823342 [TOTEM] token hold (180 ms) retransmits before loss (4 
retrans)
Aug 31 18:29:08.823366 [TOTEM] join (50 ms) send_join (0 ms) consensus (2500 
ms) merge (200 ms)
Aug 31 18:29:08.823386 [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)
Aug 31 18:29:08.823404 [TOTEM] seqno unchanged const (30 rotations) Maximum 
network MTU 1500
Aug 31 18:29:08.823423 [TOTEM] window size per rotation (50 messages) maximum 
messages per rotation (17 messages)
Aug 31 18:29:08.823441 [TOTEM] send threads (0 threads)
Aug 31 18:29:08.823457 [TOTEM] RRP token expired timeout (238 ms)
Aug 31 18:29:08.823474 [TOTEM] RRP token problem counter (2000 ms)
Aug 31 18:29:08.823491 [TOTEM] RRP threshold (10 problem count)
Aug 31 18:29:08.823507 [TOTEM] RRP mode set to none.
Aug 31 18:29:08.823524 [TOTEM] heartbeat_failures_allowed (0)
Aug 31 18:29:08.823540 [TOTEM] max_network_delay (50 ms)
Aug 31 18:29:08.823594 [TOTEM] HeartBeat is Disabled. To enable set 
heartbeat_failures_allowed > 0
Aug 31 18:29:08.824018 [TOTEM] Receive multicast socket recv buffer size 
(262142 bytes).
Aug 31 18:29:08.824049 [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
Aug 31 18:29:08.824439 [TOTEM] The network interface [147.2.207.210] is now up.
Aug 31 18:29:08.824503 [TOTEM] Created or loaded sequence id 
42284.147.2.207.210 for this ring.
Aug 31 18:29:08.824652 [TOTEM] entering GATHER state from 15.
Aug 31 18:29:08.826367 [SERV ] Service initialized 'openais extended virtual 
synchrony service'
Aug 31 18:29:08.827577 [SERV ] Service initialized 'openais cluster membership 
service B.01.01'
Aug 31 18:29:08.827903 [SERV ] Service initialized 'openais availability 
management framework B.01.01'
Aug 31 18:29:08.828319 [SERV ] Service initialized 'openais checkpoint service 
B.01.01'
Aug 31 18:29:08.829046 [SERV ] Service initialized 'openais event service 
B.01.01'
Aug 31 18:29:08.829744 [SERV ] Service initialized 'openais distributed locking 
service B.01.01'
Aug 31 18:29:08.830559 [SERV ] Service initialized 'openais message service 
B.01.01'
Aug 31 18:29:08.830832 [SERV ] Service initialized 'openais configuration 
service'
Aug 31 18:29:08.831267 [SERV ] Service initialized 'openais cluster closed 
process group service v1.01'
Aug 31 18:29:08.831540 [SERV ] Service initialized 'openais cluster config 
database access v1.01'
Aug 31 18:29:08.831574 [SYNC ] Not using a virtual synchrony filter.
Aug 31 18:29:08.831684 [TOTEM] Creating commit token because I am the rep.
Aug 31 18:29:08.831726 [TOTEM] Saving state aru 0 high seq received 0
Aug 31 18:29:08.831773 [TOTEM] Storing new sequence id for ring a530
Aug 31 18:29:08.831903 [TOTEM] entering COMMIT state.
Aug 31 18:29:08.831946 [TOTEM] entering RECOVERY state.
Aug 31 18:29:08.832026 [TOTEM] position [0] member 147.2.207.210:
Aug 31 18:29:08.832050 [TOTEM] previous ring seq 42284 rep 147.2.207.210
Aug 31 18:29:08.832069 [TOTEM] aru 0 high delivered 0 received flag 1
Aug 31 18:29:08.832087 [TOTEM] Did not need to originate any messages in 
recovery.
Aug 31 18:29:08.832127 [TOTEM] Sending initial ORF token
Aug 31 18:29:08.832363 [CLM  ] CLM CONFIGURATION CHANGE
Aug 31 18:29:08.832389 [CLM  ] New Configuration:
Aug 31 18:29:08.832406 [CLM  ] Members Left:
Aug 31 18:29:08.832423 [CLM  ] Members Joined:
Aug 31 18:29:08.832489 [CLM  ] CLM CONFIGURATION CHANGE
Aug 31 18:29:08.832511 [CLM  ] New Configuration:
Aug 31 18:29:08.832535 [CLM  ]  r(0) ip(147.2.207.210)
Aug 31 18:29:08.832554 [CLM  ] Members Left:
Aug 31 18:29:08.832570 [CLM  ] Members Joined:
Aug 31 18:29:08.832591 [CLM  ]  r(0) ip(147.2.207.210)
Aug 31 18:29:08.832623 [SYNC ] This node is within the primary component and 
will provide service.
Aug 31 18:29:08.832662 [TOTEM] entering OPERATIONAL state.
Aug 31 18:29:08.834725 [CLM  ] got nodejoin message 147.2.207.210
Aug 31 18:29:12.890629 [TOTEM] Receive multicast socket recv buffer size 
(262142 bytes).
Aug 31 18:29:12.890714 [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
Aug 31 18:29:12.891060 [TOTEM] The network interface is down.
Aug 31 18:29:12.891180 [TOTEM] entering GATHER state from 15.
aisexec: clm.c:283: my_cluster_node_load: Assertion `0' failed.
Aug 31 18:29:15.405300 [TOTEM] entering GATHER state from 0.
Aug 31 18:29:15.405381 [TOTEM] Creating commit token because I am the rep.
Aug 31 18:29:15.405410 [TOTEM] Saving state aru c high seq received c
Aug 31 18:29:15.405452 [TOTEM] Storing new sequence id for ring a534
Aug 31 18:29:15.405574 [TOTEM] entering COMMIT state.
Aug 31 18:29:15.405614 [TOTEM] entering RECOVERY state.
Aug 31 18:29:15.405684 [TOTEM] position [0] member 127.0.0.1:
Aug 31 18:29:15.405708 [TOTEM] previous ring seq 42288 rep 147.2.207.210
Aug 31 18:29:15.405726 [TOTEM] aru c high delivered c received flag 1
Aug 31 18:29:15.405744 [TOTEM] Did not need to originate any messages in 
recovery.
Aug 31 18:29:15.405782 [TOTEM] Sending initial ORF token
Aug 31 18:29:15.406020 [CLM  ] CLM CONFIGURATION CHANGE
Aug 31 18:29:15.406045 [CLM  ] New Configuration:
Aug 31 18:29:15.406071 [CLM  ]  r(0) ip(127.0.0.1)
Aug 31 18:29:15.406089 [CLM  ] Members Left:
Aug 31 18:29:15.406104 [CLM  ] Members Joined:
Aug 31 18:29:15.406124 [CLM  ] Cannot get interfaces for my_nodeid: 200007f


So the boudto.nodid became "127.0.0.2", but it could not be found from 
my_memb_list.
..At last I found something in totemip.c: totemip_iface_check().
In the funtion, whether or not an appropriate interface can be found, the 
"boundto" argument will be always updated in the end.
totemip.c:587:
totemip_copy (boundto, &ipaddr);

I applied a patch as the below:

--- openais.orig/exec/totemip.c 2009-08-31 19:30:43.000000000 +0800
+++ openais/exec/totemip.c      2009-08-31 19:03:23.000000000 +0800
@@ -497,7 +497,7 @@

                h = (struct nlmsghdr *)rcvbuf;
                if (h->nlmsg_type == NLMSG_DONE)
-                       break;
+                       return -1;

                if (h->nlmsg_type == NLMSG_ERROR) {
                        close(fd);
--- openais.org/exec/clm.c      2009-01-26 05:44:55.000000000 +0800
+++ openais/exec/clm.c  2009-08-31 20:57:18.000000000 +0800
@@ -268,13 +268,21 @@
        unsigned int iface_count;
        char **status;
        const char *iface_string;
+       int my_nodeid;
+       int res;

-       totempg_ifaces_get (
-               totempg_my_nodeid_get (),
+       my_nodeid = totempg_my_nodeid_get ();
+       res = totempg_ifaces_get (
+               my_nodeid,
                interfaces,
                &status,
                &iface_count);

+       if (res != 0) {
+               log_printf (LOG_LEVEL_DEBUG, "Cannot get interfaces for 
my_nodeid: %x", my_nodeid);
+               return ;
+       }
+
        iface_string = totemip_print (&interfaces[0]);

        sprintf ((char *)my_cluster_node.node_address.value, "%s",


It seems to reslove the problem. Otherwise there was any other consideration?
I haven't tried corosync + openais 1.0. No idea if it has the same issue.

Thanks,
    Yan
--
Software Engineer
China Server Team, OPS Engineering

Novell, Inc.
Making IT Work As One™
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [whitetank] segfault when an interface has multiple IPs

Reply via email to