meant to send my response to the list - my apologies for the double copy.

On Wed, Jul 21, 2010 at 11:21 PM, Steven Dake <steven.d...@gmail.com> wrote:

> The bug you responded to was fixed in flatiron branch 2936. (corosync 1.2.4
> or later)
>
> Which bonding mode are you using?  Are the switches connected via
> inter-switch links?  ISL is a requirement for correct bonding operation when
> using IGMP multicast.  IGMP multicast is used by corosync.
>
> At this time, we have only tested bonding mode 1 with good success (with an
> ISL).
>
> We have found bonding mode 0 in older kernels (such as those used in RHEL5
> and CentOS) is defective because of a kernel bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=570645
>
> More details about kernel version would be helpful.  Did you "unplug" one
> of the cables to test the bonding failover?
>
> With bonding, it is not recommended to use the redundant ring feature.
> There should only be one interface directive in your configuration file, and
> its value should be the bond interface.  I am not sure what would happen
> with bonding + redundant ring, but that has never been verified.
>
> Regards
> -steve
>
>
> On Wed, Jul 21, 2010 at 10:44 PM, Vladislav Bogdanov <bub...@hoster-ok.com
> > wrote:
>
>> 03.06.2010 22:42, Steven Dake wrote:
>> > The failed to receive logic in totem is not correct.  This condition
>> > occurs when a node can't receive multicast packets for a long period of
>> > time.  Generally it impacts low numbers of users which have hardware
>> > that exhibit out-of-norm behaviours.
>> >
>> > The solution is to more closely match the spec when forming a new gather
>> > list after a FAILED TO RECV is detected.  Once this occurs, a singleton
>> > ring is formed.  Then the FAILED TO RECV node is free to try to form a
>> > ring again if it can with the existing nodes.
>>
>> I'm not sure this is connected to this, but I cached (silent) corosync
>> exit after FAILED TO RECEIVE message. It was on alive node just after
>> second node came up. This is a testing installation yet, so no stonith.
>>
>> Here is a syslog snippet (sorry for line breaks):
>>
>> -------------
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] CLM CONFIGURATION CHANGE
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] New Configuration:
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.2)
>> r(1) ip(10.5.4.251)
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Left:
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Joined:
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] notice:
>> pcmk_peer_update: Transitional membership event on ring 1020: memb=1,
>> new=0, lost=0
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
>> memb: s01-1 49939722
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] CLM CONFIGURATION CHANGE
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] New Configuration:
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.1)
>> r(1) ip(10.5.4.249)
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.2)
>> r(1) ip(10.5.4.251)
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Left:
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Joined:
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.1)
>> r(1) ip(10.5.4.249)
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] notice:
>> pcmk_peer_update: Stable membership event on ring 1020: memb=2, new=1,
>> lost=0
>> Jul 19 10:15:46 s01-1 cib: [1613]: notice: ais_dispatch: Membership
>> 1020: quorum acquired
>> Jul 19 10:15:46 s01-1 crmd: [1617]: notice: ais_dispatch: Membership
>> 1020: quorum acquired
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: update_member:
>> Node 33162506/s01-0 is now: member
>> Jul 19 10:15:46 s01-1 cib: [1613]: info: crm_update_peer: Node s01-0:
>> id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1)
>> ip(10.5.4.249)  votes=1 born=880 seen=1020
>> proc=00000000000000000000000000111312
>> Jul 19 10:15:46 s01-1 crmd: [1617]: info: ais_status_callback: status:
>> s01-0 is now member (was lost)
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
>> NEW:  s01-0 33162506
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
>> MEMB: s01-0 33162506
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
>> MEMB: s01-1 49939722
>> Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_peer: Node s01-0:
>> id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1)
>> ip(10.5.4.249)  votes=1 born=880 seen=1020
>> proc=00000000000000000000000000111312
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>> send_member_notification: Sending membership update 1020 to 3 children
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [TOTEM ] A processor joined or
>> left the membership and a new membership was formed.
>> Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_quorum: Updating
>> quorum status to true (call=365)
>> Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: Operation
>> complete: op cib_delete for section //node_sta...@uname='s01-0']/lrm
>> (origin=local/crmd/361, version=0.2232.5): ok (rc=0)
>> Jul 19 10:15:46 s01-1 corosync[1605]:   [TOTEM ] FAILED TO RECEIVE
>> Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: Operation
>> complete: op cib_delete for section
>> //node_sta...@uname='s01-0']/transient_attributes
>> (origin=local/crmd/362, version=0.2232.6): ok (rc=0)
>> Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: Receiving
>> message body failed: (2) Library error: Resource temporarily unavailable
>> (11)
>> Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: AIS
>> connection failed
>> Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: Receiving
>> message body failed: (2) Library error: Resource temporarily unavailable
>> (11)
>> Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR:
>> stonith_peer_ais_destroy: AIS connection terminated
>> Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: AIS connection
>> failed
>> Jul 19 10:15:46 s01-1 attrd: [1615]: CRIT: attrd_ais_destroy: Lost
>> connection to OpenAIS service!
>> Jul 19 10:15:46 s01-1 attrd: [1615]: info: main: Exiting...
>> Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR:
>> attrd_cib_connection_destroy: Connection to the CIB terminated...
>> And so on for other pacemaker processes
>> ----------------
>>
>> No more corosync-originated messages.
>>
>> System is Fedora 13 x86_64, corosync 1.2.6, openais 1.0.3 (for OCFS2).
>> Systems are connected with one 10G back-to-back cable (eth1) and
>> additionally via VLAN over bonding formed by 4 pairs 1G intel adapters
>> (via switches).
>>
>> Here is corosync config:
>> ---------------
>> compatibility: none
>>
>> totem {
>>  version: 2
>>  token: 3000
>>  token_retransmits_before_loss_const: 10
>>  join: 60
>> #  consensus: 1500
>> #  vsftype: none
>>  max_messages: 20
>>  clear_node_high_bit: yes
>> #  secauth: on
>>  threads: 0
>>  rrp_mode: passive
>>  interface {
>>    ringnumber: 0
>>    bindnetaddr: 10.5.250.0
>>    mcastaddr: 239.94.1.1
>>    mcastport: 5405
>>  }
>>  interface {
>>    ringnumber: 1
>>    bindnetaddr: 10.5.4.0
>>    mcastaddr: 239.94.2.1
>>    mcastport: 5405
>>  }
>> }
>> logging {
>>        fileline: off
>>        to_stderr: no
>>        to_logfile: no
>>        to_syslog: yes
>>        logfile: /tmp/corosync.log
>>        debug: off
>>        timestamp: on
>>        logger_subsys {
>>                subsys: AMF
>>                debug: off
>>        }
>> }
>>
>> amf {
>>        mode: disabled
>> }
>>
>> service {
>>    name: pacemaker
>>    ver:  0
>> }
>>
>> aisexec {
>>    user:   root
>>    group:  root
>> }
>> ----------------
>>
>> I would reconfigure corosync to provide more debug output if it is
>> needed and try to re-catch that error.
>>
>> What additional information would be helpful to understand what's going
>> on?
>>
>> Thanks,
>> Vladislav
>> _______________________________________________
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>
>
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to