03.06.2010 22:42, Steven Dake wrote:
> The failed to receive logic in totem is not correct.  This condition
> occurs when a node can't receive multicast packets for a long period of
> time.  Generally it impacts low numbers of users which have hardware
> that exhibit out-of-norm behaviours.
> 
> The solution is to more closely match the spec when forming a new gather
> list after a FAILED TO RECV is detected.  Once this occurs, a singleton
> ring is formed.  Then the FAILED TO RECV node is free to try to form a
> ring again if it can with the existing nodes.

I'm not sure this is connected to this, but I cached (silent) corosync
exit after FAILED TO RECEIVE message. It was on alive node just after
second node came up. This is a testing installation yet, so no stonith.

Here is a syslog snippet (sorry for line breaks):

-------------
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] New Configuration:
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.2)
r(1) ip(10.5.4.251)
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Left:
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Joined:
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 1020: memb=1,
new=0, lost=0
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
memb: s01-1 49939722
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] New Configuration:
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.1)
r(1) ip(10.5.4.249)
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.2)
r(1) ip(10.5.4.251)
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Left:
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] Members Joined:
Jul 19 10:15:46 s01-1 corosync[1605]:   [CLM   ] #011r(0) ip(10.5.250.1)
r(1) ip(10.5.4.249)
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 1020: memb=2, new=1,
lost=0
Jul 19 10:15:46 s01-1 cib: [1613]: notice: ais_dispatch: Membership
1020: quorum acquired
Jul 19 10:15:46 s01-1 crmd: [1617]: notice: ais_dispatch: Membership
1020: quorum acquired
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: update_member:
Node 33162506/s01-0 is now: member
Jul 19 10:15:46 s01-1 cib: [1613]: info: crm_update_peer: Node s01-0:
id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1)
ip(10.5.4.249)  votes=1 born=880 seen=1020
proc=00000000000000000000000000111312
Jul 19 10:15:46 s01-1 crmd: [1617]: info: ais_status_callback: status:
s01-0 is now member (was lost)
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
NEW:  s01-0 33162506
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
MEMB: s01-0 33162506
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info: pcmk_peer_update:
MEMB: s01-1 49939722
Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_peer: Node s01-0:
id=33162506 state=member (new) addr=r(0) ip(10.5.250.1) r(1)
ip(10.5.4.249)  votes=1 born=880 seen=1020
proc=00000000000000000000000000111312
Jul 19 10:15:46 s01-1 corosync[1605]:   [pcmk  ] info:
send_member_notification: Sending membership update 1020 to 3 children
Jul 19 10:15:46 s01-1 corosync[1605]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jul 19 10:15:46 s01-1 crmd: [1617]: info: crm_update_quorum: Updating
quorum status to true (call=365)
Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: Operation
complete: op cib_delete for section //node_sta...@uname='s01-0']/lrm
(origin=local/crmd/361, version=0.2232.5): ok (rc=0)
Jul 19 10:15:46 s01-1 corosync[1605]:   [TOTEM ] FAILED TO RECEIVE
Jul 19 10:15:46 s01-1 cib: [1613]: info: cib_process_request: Operation
complete: op cib_delete for section
//node_sta...@uname='s01-0']/transient_attributes
(origin=local/crmd/362, version=0.2232.6): ok (rc=0)
Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: Receiving
message body failed: (2) Library error: Resource temporarily unavailable
(11)
Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR: ais_dispatch: AIS
connection failed
Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: Receiving
message body failed: (2) Library error: Resource temporarily unavailable
(11)
Jul 19 10:15:46 s01-1 stonith-ng: [1612]: ERROR:
stonith_peer_ais_destroy: AIS connection terminated
Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR: ais_dispatch: AIS connection
failed
Jul 19 10:15:46 s01-1 attrd: [1615]: CRIT: attrd_ais_destroy: Lost
connection to OpenAIS service!
Jul 19 10:15:46 s01-1 attrd: [1615]: info: main: Exiting...
Jul 19 10:15:46 s01-1 attrd: [1615]: ERROR:
attrd_cib_connection_destroy: Connection to the CIB terminated...
And so on for other pacemaker processes
----------------

No more corosync-originated messages.

System is Fedora 13 x86_64, corosync 1.2.6, openais 1.0.3 (for OCFS2).
Systems are connected with one 10G back-to-back cable (eth1) and
additionally via VLAN over bonding formed by 4 pairs 1G intel adapters
(via switches).

Here is corosync config:
---------------
compatibility: none

totem {
  version: 2
  token: 3000
  token_retransmits_before_loss_const: 10
  join: 60
#  consensus: 1500
#  vsftype: none
  max_messages: 20
  clear_node_high_bit: yes
#  secauth: on
  threads: 0
  rrp_mode: passive
  interface {
    ringnumber: 0
    bindnetaddr: 10.5.250.0
    mcastaddr: 239.94.1.1
    mcastport: 5405
  }
  interface {
    ringnumber: 1
    bindnetaddr: 10.5.4.0
    mcastaddr: 239.94.2.1
    mcastport: 5405
  }
}
logging {
        fileline: off
        to_stderr: no
        to_logfile: no
        to_syslog: yes
        logfile: /tmp/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}

service {
    name: pacemaker
    ver:  0
}

aisexec {
    user:   root
    group:  root
}
----------------

I would reconfigure corosync to provide more debug output if it is
needed and try to re-catch that error.

What additional information would be helpful to understand what's going on?

Thanks,
Vladislav
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to