It wasn't.

I still havent fully tracked the issue down, but it was because of
another node in the cluster. Node B which I had just started was trying
to send traffic to node A. Node A was in a weird state. Node B would not
start successfully until corosync on node A was restarted.

I've had this happen a few times now in the last few days. The ability
for one node to cause a start failure on another node is a significant
problem.



-Patrick

------------------------------------------------------------------------
*From: *Jan Friesse <[email protected]>
*Sent: * 2013-10-10 03:58:31 E
*To: *Patrick Hemmer <[email protected]>, Steven Dake
<[email protected]>
*CC: *[email protected]
*Subject: *Re: [corosync] Issue starting the CMAP service

> Patrick,
> I'm sure it's really firwall/switch problem. Please make sure that port
> and port - 1 are not blocked. For a testing purposes, you can just
> disable firewall completely and see if corosync works or not.
>
> Regards,
>   Honza
>
> Patrick Hemmer napsal(a):
>> *From: *Steven Dake <[email protected]>
>> *Sent: * 2013-09-30 18:12:25 E
>> *To: *Patrick Hemmer <[email protected]>
>> *CC: *[email protected]
>> *Subject: *Re: [corosync] Issue starting the CMAP service
>>
>>> On 09/30/2013 02:43 PM, Patrick Hemmer wrote:
>>>> *From: *Steven Dake <[email protected]>
>>>> *Sent: * 2013-09-30 16:50:26 E
>>>> *To: *Patrick Hemmer <[email protected]>
>>>> *CC: *[email protected]
>>>> *Subject: *Re: [corosync] Issue starting the CMAP service
>>>>
>>>>> On 09/30/2013 01:45 PM, Patrick Hemmer wrote:
>>>>>> I'm running corosync 2.3.2 on ubuntu precise. I'm playing with a 3
>>>>>> node cluster, and whenever I try to start corosync on one of the
>>>>>> nodes, it fails to start properly.
>>>>>> I just do a simple start with `corosync -f`, and whenever I try to 
>>>>>> use any of the tools, they error:
>>>>>>
>>>>>> # corosync-cmapctl
>>>>>> Failed to initialize the cmap API. Error CS_ERR_TRY_AGAIN
>>>>>> # corosync-quorumtool
>>>>>> Cannot initialize CMAP service
>>>>>>
>>>>>> If I wait long enough (about 9 minutes or 530 seconds), it does end
>>>>>> up starting, and the tools work, but corosync-quorumtool shows the
>>>>>> only member is itself.
>>>>>>
>>>>>> However if I start corosync with `strace -f corosync -f` the tools
>>>>>> work fine immediately upon start (though it still doesn't show the
>>>>>> other nodes). Smells like race condition, but dunno where to begin.
>>>>>>
>>>>>>
>>>>> My guess is something is wrong with your network relating to
>>>>> multicast.  Try using udpu mode - it is very stable now and removes
>>>>> multicast from the list of things that can go wrong.
>>>>>
>>>> I am using udpu, see the config :-)
>>>>
>>>>
>>> I assume you have the same config on all nodes?  If so, try using ip
>>> addresses for the ring id.  possibly a DNS resolution problem?
>>>
>>> Other then that, I'm stumped
>> Yes, exact same config on all nodes. All hosts are present in
>> /etc/hosts. Also when I do a tcpdump on the other nodes, I see traffic
>> on port 5405 coming from the node in question.
>>
>>> Regards
>>> -steve
>>>
>>>>> Regards
>>>>> -steve
>>>>>
>>>>>> This is the output from `corosync -f` (this node is 10.20.0.212):
>>>>>> notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
>>>>>> notice  [TOTEM ] Initializing transmit/receive security (NSS)
>>>>>> crypto: none hash: none
>>>>>> notice  [TOTEM ] The network interface [10.20.0.212] is now up.
>>>>>> notice  [TOTEM ] adding new UDPU member {10.20.0.127}
>>>>>> notice  [TOTEM ] adding new UDPU member {10.20.0.212}
>>>>>> notice  [TOTEM ] adding new UDPU member {10.20.2.124}
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1122820) was formed.
>>>>>> Members joined: 2
>>>>>> notice  [TOTEM ] A new membership (10.20.0.127:1122824) was formed.
>>>>>> Members joined: 1 3
>>>>>> ### here is where it pauses for almost 9 minutes ###
>>>>>> error   [TOTEM ] FAILED TO RECEIVE
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1122876) was formed.
>>>>>> Members left: 1 3
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1122936) was formed.
>>>>>> Members
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123008) was formed.
>>>>>> Members
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123064) was formed.
>>>>>> Members
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123124) was formed.
>>>>>> Members
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123180) was formed.
>>>>>> Members
>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123248) was formed.
>>>>>> Members
>>>>>> notice  [TOTEM ] A new membership (10.20.0.127:1123256) was formed.
>>>>>> Members joined: 1 3
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is the config (created by `pcs` utility), it's exactly the
>>>>>> same on all 3 nodes, and the other 2 nodes work fine:
>>>>>> ----
>>>>>> totem {
>>>>>> version: 2
>>>>>> secauth: off
>>>>>> cluster_name: hapi-server
>>>>>> transport: udpu
>>>>>> }
>>>>>>
>>>>>> nodelist {
>>>>>>   node {
>>>>>>         ring0_addr: i-74eb9c2f
>>>>>>         nodeid: 1
>>>>>>        }
>>>>>>   node {
>>>>>>         ring0_addr: i-a3bf0df9
>>>>>>         nodeid: 2
>>>>>>        }
>>>>>>   node {
>>>>>>         ring0_addr: i-ebcfcbb0
>>>>>>         nodeid: 3
>>>>>>        }
>>>>>> }
>>>>>>
>>>>>> quorum {
>>>>>> provider: corosync_votequorum
>>>>>> }
>>>>>>
>>>>>> logging {
>>>>>> to_syslog: yes
>>>>>> }
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Patrick
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> [email protected]
>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>> Here's some additional info from the command line utils after waiting 9
>> minutes for it to come up:
>>
>> # corosync-quorumtool
>> Quorum information
>> ------------------
>> Date:             Mon Sep 30 22:16:24 2013
>> Quorum provider:  corosync_votequorum
>> Nodes:            1
>> Node ID:          2
>> Ring ID:          1124320
>> Quorate:          No
>>
>> Votequorum information
>> ----------------------
>> Expected votes:   3
>> Highest expected: 3
>> Total votes:      1
>> Quorum:           2 Activity blocked
>> Flags:           
>>
>> Membership information
>> ----------------------
>>     Nodeid      Votes Name
>>          2          1 i-a3bf0df9 (local)
>>
>>
>> # corosync-cmapctl |grep member
>> runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.20.0.127)
>> runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 15
>> runtime.totem.pg.mrp.srp.members.1.status (str) = joined
>> runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.20.0.212)
>> runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.members.2.status (str) = joined
>> runtime.totem.pg.mrp.srp.members.3.ip (str) = r(0) ip(10.20.2.124)
>> runtime.totem.pg.mrp.srp.members.3.join_count (u32) = 15
>> runtime.totem.pg.mrp.srp.members.3.status (str) = joined
>>
>>
>>
>> -Patrick
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> [email protected]
>> http://lists.corosync.org/mailman/listinfo/discuss
>>

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to