Just caught it again, and run fplay. here are last lines before fplay go to infinite loop --------------- rec=[11838] Log Message=mcasted message added to pending queue rec=[11839] Log Message=Delivering bf to c4 rec=[11840] Log Message=Received ringid(10.5.250.1:1148) seq c4 rec=[11841] Log Message=Delivering bf to c4 rec=[11842] Log Message=Delivering bf to c4 rec=[11843] Log Message=Delivering bf to c4 rec=[11844] Log Message=Delivering bf to c4 rec=[11845] Log Message=Delivering bf to c4 rec=[11846] Log Message=Delivering bf to c4 rec=[11847] Log Message=Delivering bf to c4 rec=[11848] Log Message=Delivering bf to c4 rec=[11849] Log Message=Delivering bf to c4 rec=[11850] Log Message=Delivering bf to c4 rec=[11851] Log Message=Delivering bf to c4 rec=[11852] Log Message=Delivering bf to c4 rec=[11853] Log Message=Delivering bf to c4 rec=[11854] Log Message=Delivering bf to c4 rec=[11855] Log Message=Delivering bf to c4 rec=[11856] Log Message=Delivering bf to c4 rec=[11857] Log Message=Delivering bf to c4 rec=[11858] Log Message=Delivering bf to c4 rec=[11859] Log Message=Delivering bf to c4 rec=[11860] Log Message=Delivering bf to c4 rec=[11861] Log Message=FAILED TO RECEIVE rec=[11862] Log Message=entering GATHER state from 6. rec=[11863] Log Message=got commit token rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 rec=[11864] Log Message=Saving state aru bf high seq received c4 ----------------
I can send fplay data as well if it would help to identify a problem. BTW, I can't know where that infinite loop in fplay is originated from, but I expect that this could be from data which is possibly broken... 22.07.2010 14:48, Vladislav Bogdanov wrote: > 22.07.2010 08:50, Steven Dake wrote: >> On Wed, Jul 21, 2010 at 11:21 PM, Steven Dake <steven.d...@gmail.com >> <mailto:steven.d...@gmail.com>> wrote: >> >> The bug you responded to was fixed in flatiron branch 2936. >> (corosync 1.2.4 or later) >> >> Which bonding mode are you using? Are the switches connected via >> inter-switch links? ISL is a requirement for correct bonding >> operation when using IGMP multicast. IGMP multicast is used by >> corosync. > > I use 802.3ad (mode 4 of bonding driver), fast LACP with switch stack (2 > of 4 gigi ports of each cluster node go to each switch in stack). Also I > have dedicated 10gig ports on each node. I decided to use hack described > by Vadim Chepkov somewhere at OCFS2 list - I use loopback interfaces for > DRBD with redundant OSPF-controlled routes. I also installed one > high-metric route on each node to peer's loopback interface. I use -k -r > flags to zebra, so routes are not deleted from FIB on ospfd or zebra > exit. So I have redundant paths for DRBD synchronization. To separate > DRBD and OSPF traffic from rest of servers I setup VLAN over bonded > interface. bond link is outer one, so that make sense to me. So I have > one "primary" 10G link for DRBD with fallback to VLAN-over-bonding in > case of 10G interface or link problems. > > Then, I decided to use both 10g interface and VLAN-over-bonding for > redundant rings in corosync. > > OSPF uses multicast too, and it works perfectly in my setup, so I do not > expect any problems related to multicast forwarding. > ---------- > s01-0# show ip ospf neighbor > > Neighbor ID Pri State Dead Time Address Interface > RXmtL RqstL DBsmL > 10.5.249.2 1 Full/Backup 1.512s 10.5.250.6 > bond0.99:10.5.250.5 0 0 0 > 10.5.249.2 1 Full/Backup 1.512s 10.5.250.2 > eth1:10.5.250.1 0 0 0 > ---------- > >> >> At this time, we have only tested bonding mode 1 with good success >> (with an ISL). >> >> We have found bonding mode 0 in older kernels (such as those used in >> RHEL5 and CentOS) is defective because of a kernel bug: >> https://bugzilla.redhat.com/show_bug.cgi?id=570645 >> >> More details about kernel version would be helpful. Did you >> "unplug" one of the cables to test the bonding failover? > > This is a Fedora 13 kernel, 2.6.33.6-147.fc13.x86_64 > I decided to go away from CentOS5 because of DLM functionality lack. > 802.3.ad works perfectly, and there was no any plug-unplug events at the > time of error. > >> >> With bonding, it is not recommended to use the redundant ring >> feature. There should only be one interface directive in your >> configuration file, and its value should be the bond interface. I >> am not sure what would happen with bonding + redundant ring, but >> that has never been verified. > > That is what I probably need to rethink. Whould you please provide some > details why use of bonding could affect things? > > BTW, here is a bonding status > ------------- > cat /proc/net/bonding/bond0 > Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) > > Bonding Mode: IEEE 802.3ad Dynamic link aggregation > Transmit Hash Policy: layer3+4 (1) > MII Status: up > MII Polling Interval (ms): 80 > Up Delay (ms): 0 > Down Delay (ms): 160 > > 802.3ad info > LACP rate: fast > Aggregator selection policy (ad_select): bandwidth > Active Aggregator Info: > Aggregator ID: 1 > Number of ports: 4 > Actor Key: 17 > Partner Key: 8 > Partner Mac Address: 00:30:48:e0:7b:2c > > Slave Interface: eth2 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:30:48:c8:21:c4 > Aggregator ID: 1 > > Slave Interface: eth3 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:30:48:c8:21:c5 > Aggregator ID: 1 > > Slave Interface: eth4 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:1b:21:4f:2a:7c > Aggregator ID: 1 > > > > Slave Interface: eth5 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:1b:21:4f:2a:7d > Aggregator ID: 1 > ------------- >> >> Regards >> -steve >> >> >> On Wed, Jul 21, 2010 at 10:44 PM, Vladislav Bogdanov >> <bub...@hoster-ok.com <mailto:bub...@hoster-ok.com>> wrote: >> >> 03.06.2010 22:42, Steven Dake wrote: >> > The failed to receive logic in totem is not correct. This >> condition >> > occurs when a node can't receive multicast packets for a long >> period of >> > time. Generally it impacts low numbers of users which have >> hardware >> > that exhibit out-of-norm behaviours. >> > >> > The solution is to more closely match the spec when forming a >> new gather >> > list after a FAILED TO RECV is detected. Once this occurs, a >> singleton >> > ring is formed. Then the FAILED TO RECV node is free to try >> to form a >> > ring again if it can with the existing nodes. >> >> I'm not sure this is connected to this, but I cached (silent) >> corosync >> exit after FAILED TO RECEIVE message. It was on alive node just >> after >> second node came up. This is a testing installation yet, so no >> stonith. >> >> Here is a syslog snippet (sorry for line breaks): >> >> ------------- >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] CLM >> CONFIGURATION CHANGE >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] New Configuration: >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) >> ip(10.5.250.2) >> r(1) ip(10.5.4.251) >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Left: >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Joined: >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] notice: >> pcmk_peer_update: Transitional membership event on ring 1020: >> memb=1, >> new=0, lost=0 >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: >> pcmk_peer_update: >> memb: s01-1 49939722 >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] CLM >> CONFIGURATION CHANGE >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] New Configuration: >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) >> ip(10.5.250.1) >> r(1) ip(10.5.4.249) >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) >> ip(10.5.250.2) >> r(1) ip(10.5.4.251) >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Left: >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Joined: >> Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) >> ip(10.5.250.1) >> r(1) ip(10.5.4.249) >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] notice: >> pcmk_peer_update: Stable membership event on ring 1020: memb=2, >> new=1, >> lost=0 >> Jul 19 10:15:46 s01-1 cib: [1613]: notice: ais_dispatch: Membership >> 1020: quorum acquired >> Jul 19 10:15:46 s01-1 crmd: [1617]: notice: ais_dispatch: Membership >> 1020: quorum acquired >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: >> update_member: >> Node 33162506/s01-0 is now: member >> Jul 19 10:15:46 s01-1 cib: [1613]: info: crm_update_peer: Node >> s01-0: >> id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1) >> ip(10.5.4.249) votes=1 born=880 seen=1020 >> proc=00000000000000000000000000111312 >> Jul 19 10:15:46 s01-1 crmd: [1617]: info: ais_status_callback: >> status: >> s01-0 is now member (was lost) >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: >> pcmk_peer_update: >> NEW: s01-0 33162506 >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: >> pcmk_peer_update: >> MEMB: s01-0 33162506 >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: >> pcmk_peer_update: >> MEMB: s01-1 49939722 >> Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_peer: Node >> s01-0: >> id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1) >> ip(10.5.4.249) votes=1 born=880 seen=1020 >> proc=00000000000000000000000000111312 >> Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: >> send_member_notification: Sending membership update 1020 to 3 >> children >> Jul 19 10:15:46 s01-1 corosync[1605]: [TOTEM ] A processor >> joined or >> left the membership and a new membership was formed. >> Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_quorum: >> Updating >> quorum status to true (call=365) >> Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: >> Operation >> complete: op cib_delete for section //node_sta...@uname='s01-0']/lrm >> (origin=local/crmd/361, version=0.2232.5): ok (rc=0) >> Jul 19 10:15:46 s01-1 corosync[1605]: [TOTEM ] FAILED TO RECEIVE >> Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: >> Operation >> complete: op cib_delete for section >> //node_sta...@uname='s01-0']/transient_attributes >> (origin=local/crmd/362, version=0.2232.6): ok (rc=0) >> Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: >> Receiving >> message body failed: (2) Library error: Resource temporarily >> unavailable >> (11) >> Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: AIS >> connection failed >> Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: Receiving >> message body failed: (2) Library error: Resource temporarily >> unavailable >> (11) >> Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: >> stonith_peer_ais_destroy: AIS connection terminated >> Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: AIS >> connection >> failed >> Jul 19 10:15:46 s01-1 attrd: [1615]: CRIT: attrd_ais_destroy: Lost >> connection to OpenAIS service! >> Jul 19 10:15:46 s01-1 attrd: [1615]: info: main: Exiting... >> Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: >> attrd_cib_connection_destroy: Connection to the CIB terminated... >> And so on for other pacemaker processes >> ---------------- >> >> No more corosync-originated messages. >> >> System is Fedora 13 x86_64, corosync 1.2.6, openais 1.0.3 (for >> OCFS2). >> Systems are connected with one 10G back-to-back cable (eth1) and >> additionally via VLAN over bonding formed by 4 pairs 1G intel >> adapters >> (via switches). >> >> Here is corosync config: >> --------------- >> compatibility: none >> >> totem { >> version: 2 >> token: 3000 >> token_retransmits_before_loss_const: 10 >> join: 60 >> # consensus: 1500 >> # vsftype: none >> max_messages: 20 >> clear_node_high_bit: yes >> # secauth: on >> threads: 0 >> rrp_mode: passive >> interface { >> ringnumber: 0 >> bindnetaddr: 10.5.250.0 >> mcastaddr: 239.94.1.1 >> mcastport: 5405 >> } >> interface { >> ringnumber: 1 >> bindnetaddr: 10.5.4.0 >> mcastaddr: 239.94.2.1 >> mcastport: 5405 >> } >> } >> logging { >> fileline: off >> to_stderr: no >> to_logfile: no >> to_syslog: yes >> logfile: /tmp/corosync.log >> debug: off >> timestamp: on >> logger_subsys { >> subsys: AMF >> debug: off >> } >> } >> >> amf { >> mode: disabled >> } >> >> service { >> name: pacemaker >> ver: 0 >> } >> >> aisexec { >> user: root >> group: root >> } >> ---------------- >> >> I would reconfigure corosync to provide more debug output if it is >> needed and try to re-catch that error. >> >> What additional information would be helpful to understand >> what's going on? >> >> Thanks, >> Vladislav >> _______________________________________________ >> Openais mailing list >> Openais@lists.linux-foundation.org >> <mailto:Openais@lists.linux-foundation.org> >> https://lists.linux-foundation.org/mailman/listinfo/openais >> >> >> >> >> >> _______________________________________________ >> Openais mailing list >> Openais@lists.linux-foundation.org >> https://lists.linux-foundation.org/mailman/listinfo/openais > > _______________________________________________ > Openais mailing list > Openais@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais