03.06.2010 22:42, Steven Dake wrote: > The failed to receive logic in totem is not correct. This condition > occurs when a node can't receive multicast packets for a long period of > time. Generally it impacts low numbers of users which have hardware > that exhibit out-of-norm behaviours. > > The solution is to more closely match the spec when forming a new gather > list after a FAILED TO RECV is detected. Once this occurs, a singleton > ring is formed. Then the FAILED TO RECV node is free to try to form a > ring again if it can with the existing nodes.
I'm not sure this is connected to this, but I cached (silent) corosync exit after FAILED TO RECEIVE message. It was on alive node just after second node came up. This is a testing installation yet, so no stonith. Here is a syslog snippet (sorry for line breaks): ------------- Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] CLM CONFIGURATION CHANGE Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] New Configuration: Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) ip(10.5.250.2) r(1) ip(10.5.4.251) Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Left: Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Joined: Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 1020: memb=1, new=0, lost=0 Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: pcmk_peer_update: memb: s01-1 49939722 Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] CLM CONFIGURATION CHANGE Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] New Configuration: Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) ip(10.5.250.1) r(1) ip(10.5.4.249) Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) ip(10.5.250.2) r(1) ip(10.5.4.251) Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Left: Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] Members Joined: Jul 19 10:15:46 s01-1 corosync[1605]: [CLM ] #011r(0) ip(10.5.250.1) r(1) ip(10.5.4.249) Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 1020: memb=2, new=1, lost=0 Jul 19 10:15:46 s01-1 cib: [1613]: notice: ais_dispatch: Membership 1020: quorum acquired Jul 19 10:15:46 s01-1 crmd: [1617]: notice: ais_dispatch: Membership 1020: quorum acquired Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: update_member: Node 33162506/s01-0 is now: member Jul 19 10:15:46 s01-1 cib: [1613]: info: crm_update_peer: Node s01-0: id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1) ip(10.5.4.249) votes=1 born=880 seen=1020 proc=00000000000000000000000000111312 Jul 19 10:15:46 s01-1 crmd: [1617]: info: ais_status_callback: status: s01-0 is now member (was lost) Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: pcmk_peer_update: NEW: s01-0 33162506 Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: pcmk_peer_update: MEMB: s01-0 33162506 Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: pcmk_peer_update: MEMB: s01-1 49939722 Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_peer: Node s01-0: id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1) ip(10.5.4.249) votes=1 born=880 seen=1020 proc=00000000000000000000000000111312 Jul 19 10:15:46 s01-1 corosync[1605]: [pcmk ] info: send_member_notification: Sending membership update 1020 to 3 children Jul 19 10:15:46 s01-1 corosync[1605]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_quorum: Updating quorum status to true (call=365) Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: Operation complete: op cib_delete for section //node_sta...@uname='s01-0']/lrm (origin=local/crmd/361, version=0.2232.5): ok (rc=0) Jul 19 10:15:46 s01-1 corosync[1605]: [TOTEM ] FAILED TO RECEIVE Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: Operation complete: op cib_delete for section //node_sta...@uname='s01-0']/transient_attributes (origin=local/crmd/362, version=0.2232.6): ok (rc=0) Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: AIS connection failed Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: stonith_peer_ais_destroy: AIS connection terminated Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: AIS connection failed Jul 19 10:15:46 s01-1 attrd: [1615]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service! Jul 19 10:15:46 s01-1 attrd: [1615]: info: main: Exiting... Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: attrd_cib_connection_destroy: Connection to the CIB terminated... And so on for other pacemaker processes ---------------- No more corosync-originated messages. System is Fedora 13 x86_64, corosync 1.2.6, openais 1.0.3 (for OCFS2). Systems are connected with one 10G back-to-back cable (eth1) and additionally via VLAN over bonding formed by 4 pairs 1G intel adapters (via switches). Here is corosync config: --------------- compatibility: none totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 # consensus: 1500 # vsftype: none max_messages: 20 clear_node_high_bit: yes # secauth: on threads: 0 rrp_mode: passive interface { ringnumber: 0 bindnetaddr: 10.5.250.0 mcastaddr: 239.94.1.1 mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 10.5.4.0 mcastaddr: 239.94.2.1 mcastport: 5405 } } logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes logfile: /tmp/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } service { name: pacemaker ver: 0 } aisexec { user: root group: root } ---------------- I would reconfigure corosync to provide more debug output if it is needed and try to re-catch that error. What additional information would be helpful to understand what's going on? Thanks, Vladislav _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais