It wasn't. I still havent fully tracked the issue down, but it was because of another node in the cluster. Node B which I had just started was trying to send traffic to node A. Node A was in a weird state. Node B would not start successfully until corosync on node A was restarted.
I've had this happen a few times now in the last few days. The ability for one node to cause a start failure on another node is a significant problem. -Patrick ------------------------------------------------------------------------ *From: *Jan Friesse <[email protected]> *Sent: * 2013-10-10 03:58:31 E *To: *Patrick Hemmer <[email protected]>, Steven Dake <[email protected]> *CC: *[email protected] *Subject: *Re: [corosync] Issue starting the CMAP service > Patrick, > I'm sure it's really firwall/switch problem. Please make sure that port > and port - 1 are not blocked. For a testing purposes, you can just > disable firewall completely and see if corosync works or not. > > Regards, > Honza > > Patrick Hemmer napsal(a): >> *From: *Steven Dake <[email protected]> >> *Sent: * 2013-09-30 18:12:25 E >> *To: *Patrick Hemmer <[email protected]> >> *CC: *[email protected] >> *Subject: *Re: [corosync] Issue starting the CMAP service >> >>> On 09/30/2013 02:43 PM, Patrick Hemmer wrote: >>>> *From: *Steven Dake <[email protected]> >>>> *Sent: * 2013-09-30 16:50:26 E >>>> *To: *Patrick Hemmer <[email protected]> >>>> *CC: *[email protected] >>>> *Subject: *Re: [corosync] Issue starting the CMAP service >>>> >>>>> On 09/30/2013 01:45 PM, Patrick Hemmer wrote: >>>>>> I'm running corosync 2.3.2 on ubuntu precise. I'm playing with a 3 >>>>>> node cluster, and whenever I try to start corosync on one of the >>>>>> nodes, it fails to start properly. >>>>>> I just do a simple start with `corosync -f`, and whenever I try to >>>>>> use any of the tools, they error: >>>>>> >>>>>> # corosync-cmapctl >>>>>> Failed to initialize the cmap API. Error CS_ERR_TRY_AGAIN >>>>>> # corosync-quorumtool >>>>>> Cannot initialize CMAP service >>>>>> >>>>>> If I wait long enough (about 9 minutes or 530 seconds), it does end >>>>>> up starting, and the tools work, but corosync-quorumtool shows the >>>>>> only member is itself. >>>>>> >>>>>> However if I start corosync with `strace -f corosync -f` the tools >>>>>> work fine immediately upon start (though it still doesn't show the >>>>>> other nodes). Smells like race condition, but dunno where to begin. >>>>>> >>>>>> >>>>> My guess is something is wrong with your network relating to >>>>> multicast. Try using udpu mode - it is very stable now and removes >>>>> multicast from the list of things that can go wrong. >>>>> >>>> I am using udpu, see the config :-) >>>> >>>> >>> I assume you have the same config on all nodes? If so, try using ip >>> addresses for the ring id. possibly a DNS resolution problem? >>> >>> Other then that, I'm stumped >> Yes, exact same config on all nodes. All hosts are present in >> /etc/hosts. Also when I do a tcpdump on the other nodes, I see traffic >> on port 5405 coming from the node in question. >> >>> Regards >>> -steve >>> >>>>> Regards >>>>> -steve >>>>> >>>>>> This is the output from `corosync -f` (this node is 10.20.0.212): >>>>>> notice [TOTEM ] Initializing transport (UDP/IP Unicast). >>>>>> notice [TOTEM ] Initializing transmit/receive security (NSS) >>>>>> crypto: none hash: none >>>>>> notice [TOTEM ] The network interface [10.20.0.212] is now up. >>>>>> notice [TOTEM ] adding new UDPU member {10.20.0.127} >>>>>> notice [TOTEM ] adding new UDPU member {10.20.0.212} >>>>>> notice [TOTEM ] adding new UDPU member {10.20.2.124} >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1122820) was formed. >>>>>> Members joined: 2 >>>>>> notice [TOTEM ] A new membership (10.20.0.127:1122824) was formed. >>>>>> Members joined: 1 3 >>>>>> ### here is where it pauses for almost 9 minutes ### >>>>>> error [TOTEM ] FAILED TO RECEIVE >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1122876) was formed. >>>>>> Members left: 1 3 >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1122936) was formed. >>>>>> Members >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123008) was formed. >>>>>> Members >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123064) was formed. >>>>>> Members >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123124) was formed. >>>>>> Members >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123180) was formed. >>>>>> Members >>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123248) was formed. >>>>>> Members >>>>>> notice [TOTEM ] A new membership (10.20.0.127:1123256) was formed. >>>>>> Members joined: 1 3 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> This is the config (created by `pcs` utility), it's exactly the >>>>>> same on all 3 nodes, and the other 2 nodes work fine: >>>>>> ---- >>>>>> totem { >>>>>> version: 2 >>>>>> secauth: off >>>>>> cluster_name: hapi-server >>>>>> transport: udpu >>>>>> } >>>>>> >>>>>> nodelist { >>>>>> node { >>>>>> ring0_addr: i-74eb9c2f >>>>>> nodeid: 1 >>>>>> } >>>>>> node { >>>>>> ring0_addr: i-a3bf0df9 >>>>>> nodeid: 2 >>>>>> } >>>>>> node { >>>>>> ring0_addr: i-ebcfcbb0 >>>>>> nodeid: 3 >>>>>> } >>>>>> } >>>>>> >>>>>> quorum { >>>>>> provider: corosync_votequorum >>>>>> } >>>>>> >>>>>> logging { >>>>>> to_syslog: yes >>>>>> } >>>>>> ---- >>>>>> >>>>>> >>>>>> >>>>>> -Patrick >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list >>>>>> [email protected] >>>>>> http://lists.corosync.org/mailman/listinfo/discuss >> >> >> Here's some additional info from the command line utils after waiting 9 >> minutes for it to come up: >> >> # corosync-quorumtool >> Quorum information >> ------------------ >> Date: Mon Sep 30 22:16:24 2013 >> Quorum provider: corosync_votequorum >> Nodes: 1 >> Node ID: 2 >> Ring ID: 1124320 >> Quorate: No >> >> Votequorum information >> ---------------------- >> Expected votes: 3 >> Highest expected: 3 >> Total votes: 1 >> Quorum: 2 Activity blocked >> Flags: >> >> Membership information >> ---------------------- >> Nodeid Votes Name >> 2 1 i-a3bf0df9 (local) >> >> >> # corosync-cmapctl |grep member >> runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.20.0.127) >> runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 15 >> runtime.totem.pg.mrp.srp.members.1.status (str) = joined >> runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 >> runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.20.0.212) >> runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 >> runtime.totem.pg.mrp.srp.members.2.status (str) = joined >> runtime.totem.pg.mrp.srp.members.3.ip (str) = r(0) ip(10.20.2.124) >> runtime.totem.pg.mrp.srp.members.3.join_count (u32) = 15 >> runtime.totem.pg.mrp.srp.members.3.status (str) = joined >> >> >> >> -Patrick >> >> >> >> _______________________________________________ >> discuss mailing list >> [email protected] >> http://lists.corosync.org/mailman/listinfo/discuss >>
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
