Re: [Openais] corosync trunk - resolve problems with failed to receive logic

Vladislav Bogdanov Sat, 24 Jul 2010 01:20:38 -0700

Just caught it again, and run fplay.

here are last lines before fplay go to infinite loop
---------------
rec=[11838] Log Message=mcasted message added to pending queue
rec=[11839] Log Message=Delivering bf to c4
rec=[11840] Log Message=Received ringid(10.5.250.1:1148) seq c4
rec=[11841] Log Message=Delivering bf to c4
rec=[11842] Log Message=Delivering bf to c4
rec=[11843] Log Message=Delivering bf to c4
rec=[11844] Log Message=Delivering bf to c4
rec=[11845] Log Message=Delivering bf to c4
rec=[11846] Log Message=Delivering bf to c4
rec=[11847] Log Message=Delivering bf to c4
rec=[11848] Log Message=Delivering bf to c4
rec=[11849] Log Message=Delivering bf to c4
rec=[11850] Log Message=Delivering bf to c4
rec=[11851] Log Message=Delivering bf to c4
rec=[11852] Log Message=Delivering bf to c4
rec=[11853] Log Message=Delivering bf to c4
rec=[11854] Log Message=Delivering bf to c4
rec=[11855] Log Message=Delivering bf to c4
rec=[11856] Log Message=Delivering bf to c4
rec=[11857] Log Message=Delivering bf to c4
rec=[11858] Log Message=Delivering bf to c4
rec=[11859] Log Message=Delivering bf to c4
rec=[11860] Log Message=Delivering bf to c4
rec=[11861] Log Message=FAILED TO RECEIVE
rec=[11862] Log Message=entering GATHER state from 6.
rec=[11863] Log Message=got commit token
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
rec=[11864] Log Message=Saving state aru bf high seq received c4
----------------


I can send fplay data as well if it would help to identify a problem.

BTW, I can't know where that infinite loop in fplay is originated from,
but I expect that this could be from data which is possibly broken...


22.07.2010 14:48, Vladislav Bogdanov wrote:
> 22.07.2010 08:50, Steven Dake wrote:
>> On Wed, Jul 21, 2010 at 11:21 PM, Steven Dake <steven.d...@gmail.com
>> <mailto:steven.d...@gmail.com>> wrote:
>>
>>     The bug you responded to was fixed in flatiron branch 2936.
>>     (corosync 1.2.4 or later)
>>
>>     Which bonding mode are you using?  Are the switches connected via
>>     inter-switch links?  ISL is a requirement for correct bonding
>>     operation when using IGMP multicast.  IGMP multicast is used by
>>     corosync.
> 
> I use 802.3ad (mode 4 of bonding driver), fast LACP with switch stack (2
> of 4 gigi ports of each cluster node go to each switch in stack). Also I
> have dedicated 10gig ports on each node. I decided to use hack described
> by Vadim Chepkov somewhere at OCFS2 list - I use loopback interfaces for
> DRBD with redundant OSPF-controlled routes. I also installed one
> high-metric route on each node to peer's loopback interface. I use -k -r
> flags to zebra, so routes are not deleted from FIB on ospfd or zebra
> exit. So I have redundant paths for DRBD synchronization. To separate
> DRBD and OSPF traffic from rest of servers I setup VLAN over bonded
> interface. bond link is outer one, so that make sense to me. So I have
> one "primary" 10G link for DRBD with fallback to VLAN-over-bonding in
> case of 10G interface or link problems.
> 
> Then, I decided to use both 10g interface and VLAN-over-bonding for
> redundant rings in corosync.
> 
> OSPF uses multicast too, and it works perfectly in my setup, so I do not
> expect any problems related to multicast forwarding.
> ----------
> s01-0# show ip ospf neighbor
> 
>     Neighbor ID Pri State           Dead Time Address         Interface
>            RXmtL RqstL DBsmL
> 10.5.249.2        1 Full/Backup        1.512s 10.5.250.6
> bond0.99:10.5.250.5      0     0     0
> 10.5.249.2        1 Full/Backup        1.512s 10.5.250.2
> eth1:10.5.250.1          0     0     0
> ----------
> 
>>
>>     At this time, we have only tested bonding mode 1 with good success
>>     (with an ISL). 
>>
>>     We have found bonding mode 0 in older kernels (such as those used in
>>     RHEL5 and CentOS) is defective because of a kernel bug:
>>     https://bugzilla.redhat.com/show_bug.cgi?id=570645
>>
>>     More details about kernel version would be helpful.  Did you
>>     "unplug" one of the cables to test the bonding failover?
> 
> This is a Fedora 13 kernel, 2.6.33.6-147.fc13.x86_64
> I decided to go away from CentOS5 because of DLM functionality lack.
> 802.3.ad works perfectly, and there was no any plug-unplug events at the
> time of error.
> 
>>
>>     With bonding, it is not recommended to use the redundant ring
>>     feature.  There should only be one interface directive in your
>>     configuration file, and its value should be the bond interface.  I
>>     am not sure what would happen with bonding + redundant ring, but
>>     that has never been verified.
> 
> That is what I probably need to rethink. Whould you please provide some
> details why use of bonding could affect things?
> 
> BTW, here is a bonding status
> -------------
> cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
> 
> Bonding Mode: IEEE 802.3ad Dynamic link aggregation
> Transmit Hash Policy: layer3+4 (1)
> MII Status: up
> MII Polling Interval (ms): 80
> Up Delay (ms): 0
> Down Delay (ms): 160
> 
> 802.3ad info
> LACP rate: fast
> Aggregator selection policy (ad_select): bandwidth
> Active Aggregator Info:
>         Aggregator ID: 1
>         Number of ports: 4
>         Actor Key: 17
>         Partner Key: 8
>         Partner Mac Address: 00:30:48:e0:7b:2c
> 
> Slave Interface: eth2
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:30:48:c8:21:c4
> Aggregator ID: 1
> 
> Slave Interface: eth3
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:30:48:c8:21:c5
> Aggregator ID: 1
> 
> Slave Interface: eth4
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:1b:21:4f:2a:7c
> Aggregator ID: 1
> 
> 
> 
> Slave Interface: eth5
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:1b:21:4f:2a:7d
> Aggregator ID: 1
> -------------
>>
>>     Regards
>>     -steve
>>
>>
>>     On Wed, Jul 21, 2010 at 10:44 PM, Vladislav Bogdanov
>>     <bub...@hoster-ok.com <mailto:bub...@hoster-ok.com>> wrote:
>>
>>         03.06.2010 22:42, Steven Dake wrote:
>>         > The failed to receive logic in totem is not correct.  This
>>         condition
>>         > occurs when a node can't receive multicast packets for a long
>>         period of
>>         > time.  Generally it impacts low numbers of users which have
>>         hardware
>>         > that exhibit out-of-norm behaviours.
>>         >
>>         > The solution is to more closely match the spec when forming a
>>         new gather
>>         > list after a FAILED TO RECV is detected.  Once this occurs, a
>>         singleton
>>         > ring is formed.  Then the FAILED TO RECV node is free to try
>>         to form a
>>         > ring again if it can with the existing nodes.
>>
>>         I'm not sure this is connected to this, but I cached (silent)
>>         corosync
>>         exit after FAILED TO RECEIVE message. It was on alive node just
>>         after
>>         second node came up. This is a testing installation yet, so no
>>         stonith.
>>
>>         Here is a syslog snippet (sorry for line breaks):
>>
>>         -------------
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] CLM
>>         CONFIGURATION CHANGE
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] New Configuration:
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0)
>>         ip(10.5.250.2)
>>         r(1) ip(10.5.4.251)
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Left:
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Joined:
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] notice:
>>         pcmk_peer_update: Transitional membership event on ring 1020:
>>         memb=1,
>>         new=0, lost=0
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>>         pcmk_peer_update:
>>         memb: s01-1 49939722
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] CLM
>>         CONFIGURATION CHANGE
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] New Configuration:
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0)
>>         ip(10.5.250.1)
>>         r(1) ip(10.5.4.249)
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0)
>>         ip(10.5.250.2)
>>         r(1) ip(10.5.4.251)
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Left:
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Joined:
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0)
>>         ip(10.5.250.1)
>>         r(1) ip(10.5.4.249)
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] notice:
>>         pcmk_peer_update: Stable membership event on ring 1020: memb=2,
>>         new=1,
>>         lost=0
>>         Jul 19 10:15:46 s01-1 cib: [1613]: notice: ais_dispatch: Membership
>>         1020: quorum acquired
>>         Jul 19 10:15:46 s01-1 crmd: [1617]: notice: ais_dispatch: Membership
>>         1020: quorum acquired
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>>         update_member:
>>         Node 33162506/s01-0 is now: member
>>         Jul 19 10:15:46 s01-1 cib: [1613]: info: crm_update_peer: Node
>>         s01-0:
>>         id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1)
>>         ip(10.5.4.249)  votes=1 born=880 seen=1020
>>         proc=00000000000000000000000000111312
>>         Jul 19 10:15:46 s01-1 crmd: [1617]: info: ais_status_callback:
>>         status:
>>         s01-0 is now member (was lost)
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>>         pcmk_peer_update:
>>         NEW:  s01-0 33162506
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>>         pcmk_peer_update:
>>         MEMB: s01-0 33162506
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>>         pcmk_peer_update:
>>         MEMB: s01-1 49939722
>>         Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_peer: Node
>>         s01-0:
>>         id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1)
>>         ip(10.5.4.249)  votes=1 born=880 seen=1020
>>         proc=00000000000000000000000000111312
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
>>         send_member_notification: Sending membership update 1020 to 3
>>         children
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [TOTEM ] A processor
>>         joined or
>>         left the membership and a new membership was formed.
>>         Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_quorum:
>>         Updating
>>         quorum status to true (call=365)
>>         Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request:
>>         Operation
>>         complete: op cib_delete for section //node_sta...@uname='s01-0']/lrm
>>         (origin=local/crmd/361, version=0.2232.5): ok (rc=0)
>>         Jul 19 10:15:46 s01-1 corosync[1605]:   [TOTEM ] FAILED TO RECEIVE
>>         Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request:
>>         Operation
>>         complete: op cib_delete for section
>>         //node_sta...@uname='s01-0']/transient_attributes
>>         (origin=local/crmd/362, version=0.2232.6): ok (rc=0)
>>         Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch:
>>         Receiving
>>         message body failed: (2) Library error: Resource temporarily
>>         unavailable
>>         (11)
>>         Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: AIS
>>         connection failed
>>         Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: Receiving
>>         message body failed: (2) Library error: Resource temporarily
>>         unavailable
>>         (11)
>>         Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR:
>>         stonith_peer_ais_destroy: AIS connection terminated
>>         Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: AIS
>>         connection
>>         failed
>>         Jul 19 10:15:46 s01-1 attrd: [1615]: CRIT: attrd_ais_destroy: Lost
>>         connection to OpenAIS service!
>>         Jul 19 10:15:46 s01-1 attrd: [1615]: info: main: Exiting...
>>         Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR:
>>         attrd_cib_connection_destroy: Connection to the CIB terminated...
>>         And so on for other pacemaker processes
>>         ----------------
>>
>>         No more corosync-originated messages.
>>
>>         System is Fedora 13 x86_64, corosync 1.2.6, openais 1.0.3 (for
>>         OCFS2).
>>         Systems are connected with one 10G back-to-back cable (eth1) and
>>         additionally via VLAN over bonding formed by 4 pairs 1G intel
>>         adapters
>>         (via switches).
>>
>>         Here is corosync config:
>>         ---------------
>>         compatibility: none
>>
>>         totem {
>>          version: 2
>>          token: 3000
>>          token_retransmits_before_loss_const: 10
>>          join: 60
>>         #  consensus: 1500
>>         #  vsftype: none
>>          max_messages: 20
>>          clear_node_high_bit: yes
>>         #  secauth: on
>>          threads: 0
>>          rrp_mode: passive
>>          interface {
>>            ringnumber: 0
>>            bindnetaddr: 10.5.250.0
>>            mcastaddr: 239.94.1.1
>>            mcastport: 5405
>>          }
>>          interface {
>>            ringnumber: 1
>>            bindnetaddr: 10.5.4.0
>>            mcastaddr: 239.94.2.1
>>            mcastport: 5405
>>          }
>>         }
>>         logging {
>>                fileline: off
>>                to_stderr: no
>>                to_logfile: no
>>                to_syslog: yes
>>                logfile: /tmp/corosync.log
>>                debug: off
>>                timestamp: on
>>                logger_subsys {
>>                        subsys: AMF
>>                        debug: off
>>                }
>>         }
>>
>>         amf {
>>                mode: disabled
>>         }
>>
>>         service {
>>            name: pacemaker
>>            ver:  0
>>         }
>>
>>         aisexec {
>>            user:   root
>>            group:  root
>>         }
>>         ----------------
>>
>>         I would reconfigure corosync to provide more debug output if it is
>>         needed and try to re-catch that error.
>>
>>         What additional information would be helpful to understand
>>         what's going on?
>>
>>         Thanks,
>>         Vladislav
>>         _______________________________________________
>>         Openais mailing list
>>         Openais@lists.linux-foundation.org
>>         <mailto:Openais@lists.linux-foundation.org>
>>         https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>
>>
>>
>>
>> _______________________________________________
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> _______________________________________________
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync trunk - resolve problems with failed to receive logic

Reply via email to