Re: [Linux-HA] Failed to sign on to the LRM error on Corosync Startup

Andrew Beekhof Mon, 15 Apr 2013 19:46:19 -0700

On 16/04/2013, at 7:39 AM, Jimmy Magee <jimmy.ma...@vennetics.com> wrote:


> Hi Andrew, 
> 
> Thanks for your reply, we tried that option but to no avail.
> 
> To resolve the issue what worked for us was to remove existing ha packages 
> and update to pacemaker to 1.18-7.

Just moving to cman (without updating the packages) would/should have also 
worked.
But upgrading was probably a good idea anyway :) 

> 
> Here is the procedure…
> 
> 1.    Backup /etc/corosync/corosync.conf, /etc/corosync/authkey.
> 2.    Export cib.xml:
>               cibadmin -Q > /tmp/ha_backup/cib.xml
> 3.    Stop corosync services on all nodes
> 4.    Remove existing HA packages:
>               yum -y remove pacemaker corosync heartbeat resource-agents 
> cluster-glue rgmanager lvm2-cluster gfs2-utils
> 5.    Install updated HA Packages:
>               yum -y install pacemaker cman ccs resource agents
> 
>               resulting in the following packages being installed..
>               pacemaker-doc-1.1.8-7.el6.x86_64
>               pacemaker-cli-1.1.8-7.el6.x86_64
>               pacemaker-libs-1.1.8-7.el6.x86_64
>               pacemaker-cts-1.1.8-7.el6.x86_64
>               pacemaker-libs-devel-1.1.8-7.el6.x86_64
>               pacemaker-cluster-libs-1.1.8-7.el6.x86_64
>               pacemaker-1.1.8-7.el6.x86_64
>               pacemaker-debuginfo-1.1.8-7.el6.x86_64
>               cman-3.0.12.1-49.el6.x86_64
>               ccs-0.16.2-55.el6.x86_64
>               resource-agents-3.9.2-12.el6.x86_64
>               cluster-glue-libs-1.0.5-6.el6.x86_64
>               corosync-1.4.1-15.el6.x86_64
>               corosynclib-1.4.1-15.el6.x86_64
>               corosync-debuginfo-1.4.1-15.el6.x86_64
>               corosynclib-devel-1.4.1-15.el6.x86_64
> 
> 6.    Get crm package and install:
>                yum -y install crmsh*
> 7.    Start the ricci service:
>               service ricci start
>       Also ensure it restarts:
>               chkconfig --add ricci
> 8.    Set ricci passwd:
>               passwd ricci
> 9.    Configure the cluster:
>               ccs -f /etc/cluster/cluster.conf --createcluster testprod -i
>               ccs -f /etc/cluster/cluster.conf --addnode w0110Danmtapp01
>               ccs -f /etc/cluster/cluster.conf --addnode node02
>               ccs -f /etc/cluster/cluster.conf --addnode node03
>               ccs -f /etc/cluster/cluster.conf --addfencedev pcmk 
> agent=fence_pcmk
>               ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect 
> node01
>               ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect 
> node02
>               ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect 
> node03
>               ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node01 
> pcmk-redirect port=1
>               ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node02 
> pcmk-redirect port=2
>               ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node03 
> pcmk-redirect port=3
>               ccs -f /etc/cluster/cluster.conf --setlogging debug=on
>               ccs -f /etc/cluster/cluster.conf --settotem 
> 10.   Distribute the cluster.conf:
>               ccs -h node01 -p ************** --sync --activate
> 11.   Set Cluster timeout to 0 on all three nodes separately:
>               echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
> 12.   Start the services on each node:
>       service cman start
>       service pacemaker start
>       also ensure they restart:
>               chkconfig --add cman
>       chkconfig --add pacemaker
> 
> 
> Best of luck,
> Jimmy.
> 
> 
> 
> 
> On 12 Apr 2013, at 02:11, Andrew Beekhof <and...@beekhof.net> wrote:
> 
>> 
>> On 11/04/2013, at 6:05 AM, Jimmy Magee <jimmy.ma...@vennetics.com> wrote:
>> 
>>> Hi,
>>> 
>>> Following up on the above thread, any thoughts as to what may be causing 
>>> the issue..
>> 
>> One of the main reasons pacemakerd was created was to avoid weirdness around 
>> the starting of pacemaker's child processes from within a multi-threaded 
>> application like corosync... which is almost certainly what you're bumping 
>> into here.
>> 
>> Could you try using "ver: 1" in corosync.conf and "service pacemaker start" 
>> to rule out any other causes?
>> 
>>> 
>>> Cheers,
>>> Jimmy.
>>> 
>>> 
>>> 
>>> On 9 Apr 2013, at 13:39, Jimmy Magee <jimmy.ma...@vennetics.com> wrote:
>>> 
>>>> Hi Andrew,
>>>> 
>>>> The corosync.conf is configured as follows:
>>>> 
>>>> 
>>>>> service {
>>>>>              # Load the Pacemaker Cluster Resource Manager
>>>>>               name: pacemaker
>>>>>               ver:  0
>>>>>      }
>>>> 
>>>> 
>>>> 
>>>> and pacemaker is not started via service pacemaker start…
>>>> 
>>>> here is the extract from the logs with extra debug when attempting to 
>>>> start corosync/pacemaker..
>>>> 
>>>> 06:59:20 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and 
>>>> ready to provide service.
>>>> 06:59:20 corosync [MAIN  ] Corosync built-in features: nss dbus rdma snmp
>>>> 06:59:20 corosync [MAIN  ] Successfully read main configuration file 
>>>> '/etc/corosync/corosync.conf'.
>>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>>>> 06:59:20 corosync [TOTEM ] Token Timeout (5000 ms) retransmit timeout (247 
>>>> ms)
>>>> 06:59:20 corosync [TOTEM ] token hold (187 ms) retransmits before loss (20 
>>>> retrans)
>>>> 06:59:20 corosync [TOTEM ] join (1000 ms) send_join (0 ms) consensus (7500 
>>>> ms) merge (200 ms)
>>>> 06:59:20 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (2500 
>>>> msgs)
>>>> 06:59:20 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum 
>>>> network MTU 1402
>>>> 06:59:20 corosync [TOTEM ] window size per rotation (50 messages) maximum 
>>>> messages per rotation (20 messages)
>>>> 06:59:20 corosync [TOTEM ] missed count const (5 messages)
>>>> 06:59:20 corosync [TOTEM ] send threads (0 threads)
>>>> 06:59:20 corosync [TOTEM ] RRP token expired timeout (247 ms)
>>>> 06:59:20 corosync [TOTEM ] RRP token problem counter (2000 ms)
>>>> 06:59:20 corosync [TOTEM ] RRP threshold (10 problem count)
>>>> 06:59:20 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>>> 06:59:20 corosync [TOTEM ] RRP automatic recovery check timeout (1000 ms)
>>>> 06:59:20 corosync [TOTEM ] RRP mode set to none.
>>>> 06:59:20 corosync [TOTEM ] heartbeat_failures_allowed (0)
>>>> 06:59:20 corosync [TOTEM ] max_network_delay (50 ms)
>>>> 06:59:20 corosync [TOTEM ] HeartBeat is Disabled. To enable set 
>>>> heartbeat_failures_allowed > 0
>>>> 06:59:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
>>>> 06:59:20 corosync [TOTEM ] Initializing transmit/receive security: 
>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>> 06:59:20 corosync [IPC   ] you are using ipc api v2
>>>> 06:59:20 corosync [TOTEM ] Receive multicast socket recv buffer size 
>>>> (320000 bytes).
>>>> 06:59:20 corosync [TOTEM ] Transmit multicast socket send buffer size 
>>>> (320000 bytes).
>>>> 06:59:20 corosync [TOTEM ] Local receive multicast loop socket recv buffer 
>>>> size (320000 bytes).
>>>> 06:59:20 corosync [TOTEM ] Local transmit multicast loop socket send 
>>>> buffer size (320000 bytes).
>>>> 06:59:20 corosync [TOTEM ] The network interface [10.87.79.59] is now up.
>>>> 06:59:20 corosync [TOTEM ] Created or loaded sequence id 6984.10.87.79.59 
>>>> for this ring.
>>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>>>> 06:59:20 corosync [pcmk  ] Logging: Initialized pcmk_startup
>>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>> Manager 1.1.6
>>>> 06:59:20 corosync [pcmk  ] Logging: Initialized pcmk_startup
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>> Manager 1.1.6
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync extended 
>>>> virtual synchrony service
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync configuration 
>>>> service
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster closed 
>>>> process group service v1.01
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster config 
>>>> database access v1.01
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync profile loading 
>>>> service
>>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster quorum 
>>>> service v0.1
>>>> 06:59:20 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 
>>>> and V2 of the synchronization engine.
>>>> 06:59:20 corosync [TOTEM ] entering GATHER state from 15.
>>>> 06:59:20 corosync [TOTEM ] Creating commit token because I am the rep.
>>>> 06:59:20 corosync [TOTEM ] Saving state aru 0 high seq received 0
>>>> 06:59:20 corosync [TOTEM ] Storing new sequence id for ring 1b4c
>>>> 06:59:20 corosync [TOTEM ] entering COMMIT state.
>>>> 06:59:20 corosync [TOTEM ] got commit token
>>>> 06:59:20 corosync [TOTEM ] entering RECOVERY state.
>>>> 06:59:20 corosync [TOTEM ] position [0] member 10.87.79.59:
>>>> 06:59:20 corosync [TOTEM ] previous ring seq 6984 rep 10.87.79.59
>>>> 06:59:20 corosync [TOTEM ] aru 0 high delivered 0 received flag 1
>>>> 06:59:20 corosync [TOTEM ] Did not need to originate any messages in 
>>>> recovery.
>>>> 06:59:20 corosync [TOTEM ] got commit token
>>>> 06:59:20 corosync [TOTEM ] Sending initial ORF token
>>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>>> retrans queue empty 1 count 0, aru 0
>>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>>> retrans queue empty 1 count 1, aru 0
>>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>>> retrans queue empty 1 count 2, aru 0
>>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>>> retrans queue empty 1 count 3, aru 0
>>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>>> 06:59:20 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 
>>>> aru 0 0
>>>> 06:59:20 corosync [TOTEM ] Resetting old ring state
>>>> 06:59:20 corosync [TOTEM ] recovery to regular 1-0
>>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>>>> 06:59:20 corosync [SYNC  ] This node is within the primary component and 
>>>> will provide service.
>>>> 06:59:20 corosync [TOTEM ] entering OPERATIONAL state.
>>>> 06:59:20 corosync [TOTEM ] A processor joined or left the membership and a 
>>>> new membership was formed.
>>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>>> = 1. 
>>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy CLM 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>>> = 1. 
>>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy CLM 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy AMF 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>>> = 1. 
>>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy AMF 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy 
>>>> CKPT service)
>>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>>> = 1. 
>>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy CKPT 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy EVT 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>>> = 1. 
>>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy EVT 
>>>> service)
>>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (corosync 
>>>> cluster closed process group service v1.01)
>>>> 06:59:20 corosync [CPG   ] comparing: sender r(0) ip(10.87.79.59) ; 
>>>> members(old:0 left:0)
>>>> 06:59:20 corosync [CPG   ] chosen downlist: sender r(0) ip(10.87.79.59) ; 
>>>> members(old:0 left:0)
>>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>>> = 1. 
>>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (corosync 
>>>> cluster closed process group service v1.01)
>>>> 06:59:20 corosync [MAIN  ] Completed service synchronization, ready to 
>>>> provide service.
>>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>>> handler for signal 15
>>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>>> handler for signal 17
>>>> 06:59:20node03lrmd: [14934]: info: enabling coredumps
>>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>>> handler for signal 10
>>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>>> handler for signal 12
>>>> 06:59:20node03lrmd: [14934]: debug: main: run the loop...
>>>> 06:59:20node03lrmd: [14934]: info: Started.
>>>> 06:59:20 [14935]node03     attrd:     info: crm_log_init_worker:   Changed 
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> 06:59:20 [14935]node03     attrd:     info: main:  Starting up
>>>> 06:59:20 [14935]node03     attrd:     info: get_cluster_type:      Cluster 
>>>> type is: 'openais'
>>>> 06:59:20 [14935]node03     attrd:   notice: crm_cluster_connect:   
>>>> Connecting to cluster infrastructure: classic openais (with plugin)
>>>> 06:59:20 [14936]node03   pengine:     info: crm_log_init_worker:   Changed 
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_classic:   
>>>> Creating connection to our Corosync plugin
>>>> 06:59:20 [14936]node03   pengine:    debug: main:  Checking for old 
>>>> instances of pengine
>>>> 06:59:20 [14937]node03      crmd:     info: crm_log_init_worker:   Changed 
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> 06:59:20 [14936]node03   pengine:    debug: 
>>>> init_client_ipc_comms_nodispatch:      Attempting to talk on: 
>>>> /var/run/crm/pengine
>>>> 06:59:20 [14937]node03      crmd:   notice: main:  CRM Hg Version: 
>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>> 06:59:20 [14936]node03   pengine:    debug: 
>>>> init_client_ipc_comms_nodispatch:      Could not init comms on: 
>>>> /var/run/crm/pengine
>>>> 06:59:20 [14936]node03   pengine:    debug: main:  Init server comms
>>>> 06:59:20 [14936]node03   pengine:     info: main:  Starting pengine
>>>> 06:59:20 [14937]node03      crmd:    debug: crmd_init:     Starting crmd
>>>> 06:59:20 [14937]node03      crmd:    debug: s_crmd_fsa:    Processing 
>>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:         
>>>> actions:trace:  // A_LOG   
>>>> 06:59:20 [14937]node03      crmd:    debug: do_log:        FSA: Input 
>>>> I_STARTUP from crmd_init() received in state S_STARTING
>>>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:         
>>>> actions:trace:  // A_STARTUP
>>>> 06:59:20 [14937]node03      crmd:    debug: do_startup:    Registering 
>>>> Signal Handlers
>>>> 06:59:20 [14937]node03      crmd:    debug: do_startup:    Creating CIB 
>>>> and LRM objects
>>>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:         
>>>> actions:trace:  // A_CIB_START
>>>> 06:59:20 [14937]node03      crmd:    debug: 
>>>> init_client_ipc_comms_nodispatch:      Attempting to talk on: 
>>>> /var/run/crm/cib_rw
>>>> 06:59:20 [14937]node03      crmd:    debug: 
>>>> init_client_ipc_comms_nodispatch:      Could not init comms on: 
>>>> /var/run/crm/cib_rw
>>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:         
>>>> Connection to command channel failed
>>>> 06:59:20 [14937]node03      crmd:    debug: 
>>>> init_client_ipc_comms_nodispatch:      Attempting to talk on: 
>>>> /var/run/crm/cib_callback
>>>> 06:59:20 [14937]node03      crmd:    debug: 
>>>> init_client_ipc_comms_nodispatch:      Could not init comms on: 
>>>> /var/run/crm/cib_callback
>>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:         
>>>> Connection to callback channel failed
>>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:         
>>>> Connection to CIB failed: connection failed
>>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signoff:    Signing 
>>>> out of the CIB Service
>>>> 06:59:20 [14935]node03     attrd:    debug: init_ais_connection_classic:   
>>>> Adding fd=6 to mainloop
>>>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_classic:   
>>>> AIS connection established
>>>> 06:59:20 [14935]node03     attrd:     info: get_ais_nodeid:        Server 
>>>> details: id=1003428268 uname=node03 cname=pcmk
>>>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_once:      
>>>> Connection to 'classic openais (with plugin)': established
>>>> 06:59:20 [14935]node03     attrd:    debug: crm_new_peer:  Creating entry 
>>>> for node node03/1003428268
>>>> 06:59:20 [14935]node03     attrd:     info: crm_new_peer:  Nodenode03now 
>>>> has id: 1003428268
>>>> 06:59:20 [14935]node03     attrd:     info: crm_new_peer:  Node 1003428268 
>>>> is now known as node03
>>>> 06:59:20 [14935]node03     attrd:     info: main:  Cluster connection 
>>>> active
>>>> 06:59:20 [14935]node03     attrd:     info: main:  Accepting attribute 
>>>> updates
>>>> 06:59:20 [14935]node03     attrd:   notice: main:  Starting mainloop...
>>>> 06:59:20 [14933]node03stonith-ng:     info: crm_log_init_worker:   Changed 
>>>> active directory to /var/lib/heartbeat/cores/root
>>>> 06:59:20 [14933]node03stonith-ng:     info: get_cluster_type:      Cluster 
>>>> type is: 'openais'
>>>> 06:59:20 [14933]node03stonith-ng:   notice: crm_cluster_connect:   
>>>> Connecting to cluster infrastructure: classic openais (with plugin)
>>>> 06:59:20 [14933]node03stonith-ng:     info: init_ais_connection_classic:   
>>>> Creating connection to our Corosync plugin
>>>> 06:59:20 [14932]node03       cib:     info: crm_log_init_worker:   Changed 
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> 06:59:20 [14932]node03       cib:     info: retrieveCib:   Reading cluster 
>>>> configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk] <cib epoch="251" num_updates="0" admin_epoch="1" 
>>>> validate-with="pacemaker-1.2" crm_feature_set="3.0.6" 
>>>> update-origin="node03" update-client="crmd" cib-last-written="Tue Apr  9 
>>>> 06:48:33 2013" have-quorum="1" >
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]   <configuration >
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]     <crm_config >
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]       <cluster_property_set id="cib-bootstrap-options" >
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair 
>>>> id="cib-bootstrap-options-default-resource-stickiness" 
>>>> name="default-resource-stickiness" value="1000" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair id="cib-bootstrap-options-no-quorum-policy" 
>>>> name="no-quorum-policy" value="ignore" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair id="cib-bootstrap-options-stonith-enabled" 
>>>> name="stonith-enabled" value="false" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair id="cib-bootstrap-options-expected-quorum-votes" 
>>>> name="expected-quorum-votes" value="3" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair id="cib-bootstrap-options-dc-version" 
>>>> name="dc-version" 
>>>> value="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair 
>>>> id="cib-bootstrap-options-cluster-infrastructure" 
>>>> name="cluster-infrastructure" value="openais" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]         <nvpair id="cib-bootstrap-options-last-lrm-refresh" 
>>>> name="last-lrm-refresh" value="1365160119" />
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]       </cluster_property_set>
>>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:        
>>>> [on-disk]     </crm_config>
>>>> …
>>>> …
>>>> ...
>>>> 
>>>> 
>>>> We are still seeing the extra pacemaker daemons when corosync starts up.
>>>> As an added check, all pacemaker daemons exited correctly when stoping 
>>>> corosync.
>>>> ldmd attempts to start twice..
>>>> 
>>>> ps aux | grep lrmd
>>>> root     16412  0.0  0.0      0     0 ?        Z    07:20   0:00 [lrmd] 
>>>> <defunct>
>>>> root     16419  0.0  0.0  34240  1052 ?        S    07:20   0:00 
>>>> /usr/lib64/heartbeat/lrmd
>>>> root     21030  0.0  0.0 103244   856 pts/0    S+   08:37   0:00 grep lrmd
>>>> 
>>>> 
>>>> Help to resolve this issue appreciated..
>>>> 
>>>> Cheers,
>>>> Jimmy.
>>>> 
>>>> 
>>>> On 9 Apr 2013, at 00:16, Andrew Beekhof <and...@beekhof.net> wrote:
>>>> 
>>>>> 
>>>>> On 08/04/2013, at 9:44 PM, Jimmy Magee <jimmy.ma...@vennetics.com> wrote:
>>>>> 
>>>>>> Hi Andrew,
>>>>>> 
>>>>>> thanks for your reply, we are running at debug level with the following 
>>>>>> config from corosync.conf
>>>>>> 
>>>>>> logging {
>>>>>>            fileline: off
>>>>>>            to_syslog: yes
>>>>>>            to_stderr: no
>>>>>>            syslog_facility: daemon
>>>>>>            debug: on
>>>>>>            timestamp: on
>>>>>>     }
>>>>>> 
>>>>>> Looking at the issue further, there seems to be 2 instances of some 
>>>>>> pacemaker daemons running on this particular node….
>>>>>> 
>>>>>> 
>>>>>> ps aux | grep pace
>>>>>> 
>>>>>> 495       3050  0.2  0.0  89956  7184 ?        S    07:10   0:01 
>>>>>> /usr/libexec/pacemaker/cib
>>>>>> root      3051  0.0  0.0  87128  3152 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>> 495       3053  0.0  0.0  91188  2840 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/attrd
>>>>>> 495       3054  0.0  0.0  87336  2484 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/pengine
>>>>>> 495       3055  0.0  0.0  91332  3156 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/crmd
>>>>>> 495       3057  0.0  0.0  88876  5224 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/cib
>>>>>> root      3058  0.0  0.0  87128  3132 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>> 495       3060  0.0  0.0  91188  2788 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/attrd
>>>>>> 495       3062  0.0  0.0  91436  3932 ?        S    07:10   0:00 
>>>>>> /usr/libexec/pacemaker/crmd
>>>>>> 
>>>>>> 
>>>>>> ps aux | grep corosync
>>>>>> root      3044  0.1  0.0 977852  9264 ?        Ssl  07:10   0:01 corosync
>>>>>> root      9363  0.0  0.0 103248   856 pts/0    S+   07:33   0:00 grep 
>>>>>> corosync
>>>>>> 
>>>>>> 
>>>>>> ps aux | grep lrmd
>>>>>> root      3052  0.0  0.0  76464  2528 ?        S    07:10   0:00 
>>>>>> /usr/lib64/heartbeat/lrmd
>>>>>> 
>>>>>> 
>>>>>> Not sure why this is the case? Appreciate any help..
>>>>>> 
>>>>> 
>>>>> Have you perhaps specified "ver: 0" for the pacemaker plugin and run 
>>>>> "service pacemaker start" ?
>>>>> 
>>>>>> Cheers,
>>>>>> Jimmy.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 8 Apr 2013, at 03:00, Andrew Beekhof <and...@beekhof.net> wrote:
>>>>>> 
>>>>>>> This doesn't look promising:
>>>>>>> 
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 15
>>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 17
>>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 10
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 12
>>>>>>> lrmd: [4939]: info: Started.
>>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>>> 
>>>>>>> The lrmd comes up but then immediately shuts down.
>>>>>>> Perhaps try enabling debug to see if that sheds any light.
>>>>>>> 
>>>>>>> On 06/04/2013, at 4:58 AM, Jimmy Magee <jimmy.ma...@vennetics.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> Apologies for reposting this query, it inadvertently got added to an 
>>>>>>>> existing topic!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We have a three node cluster deployed in a customer's network:
>>>>>>>> - 2 nodes are on the same switch
>>>>>>>> - 3rd node on the same subnet but there's a router in between.
>>>>>>>> - IP Multicast is enabled and has been tested using omping as follows..
>>>>>>>> 
>>>>>>>> On each node ran..
>>>>>>>> 
>>>>>>>> omping node01 node02 node3
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ON node 3
>>>>>>>> 
>>>>>>>> Node01 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>>>> 0.128/0.181/0.255/0.025
>>>>>>>> Node01 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>>>> 0.140/0.187/0.219/0.021
>>>>>>>> Node02 :   unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>>> 0.115/0.150/0.168/0.021
>>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>>> 0.134/0.162/0.177/0.014
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On node 2
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Node01 :   unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 
>>>>>>>> 0.168/0.191/0.205/0.014
>>>>>>>> Node01 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), 
>>>>>>>> min/avg/max/std-dev = 0.138/0.179/0.206/0.028
>>>>>>>> Node03 :   unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 
>>>>>>>> 0.112/0.149/0.175/0.022
>>>>>>>> Node03 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), 
>>>>>>>> min/avg/max/std-dev = 0.124/0.167/0.178/0.018
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On node 1
>>>>>>>> 
>>>>>>>> Node02 :   unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>>> 0.154/0.185/0.208/0.019
>>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>>> 0.175/0.198/0.214/0.015
>>>>>>>> Node03 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>>>> 0.114/0.160/0.185/0.019
>>>>>>>> Node03 : multicast, xmt/rcv/%loss = 23/22/4% (seq>=2 0%), 
>>>>>>>> min/avg/max/std-dev = 0.124/0.172/0.197/0.019
>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Problem is intermittent but frequent. Occasionally starts fine when 
>>>>>>>> started from scratch.
>>>>>>>> 
>>>>>>>> We suspect the problem is related to node 3 as we can see lrmd 
>>>>>>>> failures as per the attached log. We've checked permissions are ok as 
>>>>>>>> per https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> stonith-ng[1437]:    error: ais_dispatch: AIS connection failed
>>>>>>>> stonith-ng[1437]:    error: stonith_peer_ais_destroy: AIS connection 
>>>>>>>> terminated
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: Pacemaker Cluster 
>>>>>>>> Manager 1.1.6
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync extended 
>>>>>>>> virtual synchrony service
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync 
>>>>>>>> configuration service
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>>>> closed process group service v1.01
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>>>> config database access v1.01
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync profile 
>>>>>>>> loading service
>>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>>>> quorum service v0.1
>>>>>>>> corosync[1430]:   [MAIN  ] Corosync Cluster Engine exiting with status 
>>>>>>>> 0 at main.c:1894.
>>>>>>>> 
>>>>>>>> corosync[4931]:   [MAIN  ] Corosync built-in features: nss dbus rdma 
>>>>>>>> snmp
>>>>>>>> corosync[4931]:   [MAIN  ] Successfully read main configuration file 
>>>>>>>> '/etc/corosync/corosync.conf'.
>>>>>>>> corosync[4931]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
>>>>>>>> corosync[4931]:   [TOTEM ] Initializing transmit/receive security: 
>>>>>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>>>>>> corosync[4931]:   [TOTEM ] The network interface [10.87.79.59] is now 
>>>>>>>> up.
>>>>>>>> corosync[4931]:   [pcmk  ] Logging: Initialized pcmk_startup
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>>>>>> Manager 1.1.6
>>>>>>>> corosync[4931]:   [pcmk  ] Logging: Initialized pcmk_startup
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>>>>>> Manager 1.1.6
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync extended 
>>>>>>>> virtual synchrony service
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync 
>>>>>>>> configuration service
>>>>>>>> orosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>>>> closed process group service v1.01
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>>>> config database access v1.01
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync profile 
>>>>>>>> loading service
>>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>>>> quorum service v0.1
>>>>>>>> corosync[4931]:   [MAIN  ] Compatibility mode set to whitetank.  Using 
>>>>>>>> V1 and V2 of the synchronization engine.
>>>>>>>> corosync[4931]:   [TOTEM ] A processor joined or left the membership 
>>>>>>>> and a new membership was formed.
>>>>>>>> corosync[4931]:   [CPG   ] chosen downlist: sender r(0) 
>>>>>>>> ip(10.87.79.59) ; members(old:0 left:0)
>>>>>>>> corosync[4931]:   [MAIN  ] Completed service synchronization, ready to 
>>>>>>>> provide service.
>>>>>>>> cib[4937]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>>> cib[4937]:     info: retrieveCib: Reading cluster configuration from: 
>>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: 
>>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>>>> cib[4937]:     info: validate_with_relaxng: Creating RNG parser context
>>>>>>>> stonith-ng[4945]:     info: crm_log_init_worker: Changed active 
>>>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>>>> stonith-ng[4945]:     info: get_cluster_type: Cluster type is: 
>>>>>>>> 'openais'
>>>>>>>> stonith-ng[4945]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> stonith-ng[4945]:     info: init_ais_connection_classic: Creating 
>>>>>>>> connection to our Corosync plugin
>>>>>>>> cib[4944]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>>> cib[4944]:     info: retrieveCib: Reading cluster configuration from: 
>>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: 
>>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>>>> stonith-ng[4945]:     info: init_ais_connection_classic: AIS 
>>>>>>>> connection established
>>>>>>>> stonith-ng[4945]:     info: get_ais_nodeid: Server details: 
>>>>>>>> id=1003428268 uname=w0110Danmtapp03 cname=pcmk
>>>>>>>> stonith-ng[4945]:     info: init_ais_connection_once: Connection to 
>>>>>>>> 'classic openais (with plugin)': established
>>>>>>>> stonith-ng[4945]:     info: crm_new_peer: Node node03 now has id: 
>>>>>>>> 1003428268
>>>>>>>> stonith-ng[4945]:     info: crm_new_peer: Node 1003428268 is now known 
>>>>>>>> as node03
>>>>>>>> cib[4944]:     info: validate_with_relaxng: Creating RNG parser context
>>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>>> signal 15
>>>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to 
>>>>>>>> exit
>>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>>> signal 17
>>>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>>>> stonith-ng[4938]:     info: crm_log_init_worker: Changed active 
>>>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>>> signal 10
>>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>>> signal 12
>>>>>>>> lrmd: [4939]: info: Started.
>>>>>>>> stonith-ng[4938]:     info: get_cluster_type: Cluster type is: 
>>>>>>>> 'openais'
>>>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>>>> stonith-ng[4938]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> stonith-ng[4938]:     info: init_ais_connection_classic: Creating 
>>>>>>>> connection to our Corosync plugin
>>>>>>>> attrd[4940]:     info: crm_log_init_worker: Changed active directory 
>>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>>> pengine[4941]:     info: crm_log_init_worker: Changed active directory 
>>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>>> attrd[4940]:     info: main: Starting up
>>>>>>>> attrd[4940]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>>> attrd[4940]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> attrd[4940]:     info: init_ais_connection_classic: Creating 
>>>>>>>> connection to our Corosync plugin
>>>>>>>> crmd[4942]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>>> pengine[4941]:     info: main: Starting pengine
>>>>>>>> crmd[4942]:   notice: main: CRM Hg Version: 
>>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>>>> pengine[4948]:     info: crm_log_init_worker: Changed active directory 
>>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>>> pengine[4948]:  warning: main: Terminating previous PE instance
>>>>>>>> attrd[4947]:     info: crm_log_init_worker: Changed active directory 
>>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>>> pengine[4941]:  warning: process_pe_message: Received quit message, 
>>>>>>>> terminating
>>>>>>>> attrd[4947]:     info: main: Starting up
>>>>>>>> attrd[4947]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>>> attrd[4947]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> attrd[4947]:     info: init_ais_connection_classic: Creating 
>>>>>>>> connection to our Corosync plugin
>>>>>>>> crmd[4949]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>>> crmd[4949]:   notice: main: CRM Hg Version: 
>>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>>>> stonith-ng[4938]:     info: init_ais_connection_classic: AIS 
>>>>>>>> connection established
>>>>>>>> stonith-ng[4938]:     info: get_ais_nodeid: Server details: 
>>>>>>>> id=1003428268 uname=node03 cname=pcmk
>>>>>>>> stonith-ng[4938]:     info: init_ais_connection_once: Connection to 
>>>>>>>> 'classic openais (with plugin)': established
>>>>>>>> stonith-ng[4938]:     info: crm_new_peer: Node node03 now has id: 
>>>>>>>> 1003428268
>>>>>>>> stonith-ng[4938]:     info: crm_new_peer: Node 1003428268 is now known 
>>>>>>>> as node03
>>>>>>>> attrd[4940]:     info: init_ais_connection_classic: AIS connection 
>>>>>>>> established
>>>>>>>> attrd[4940]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>>> uname=node03 cname=pcmk
>>>>>>>> attrd[4940]:     info: init_ais_connection_once: Connection to 
>>>>>>>> 'classic openais (with plugin)': established
>>>>>>>> attrd[4940]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>>> attrd[4940]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>>> node03
>>>>>>>> attrd[4940]:     info: main: Cluster connection active
>>>>>>>> attrd[4940]:     info: main: Accepting attribute updates
>>>>>>>> attrd[4940]:   notice: main: Starting mainloop...
>>>>>>>> attrd[4947]:     info: init_ais_connection_classic: AIS connection 
>>>>>>>> established
>>>>>>>> attrd[4947]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>>> uname=node03 cname=pcmk
>>>>>>>> attrd[4947]:     info: init_ais_connection_once: Connection to 
>>>>>>>> 'classic openais (with plugin)': established
>>>>>>>> attrd[4947]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>>> attrd[4947]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>>> node03
>>>>>>>> attrd[4947]:     info: main: Cluster connection active
>>>>>>>> attrd[4947]:     info: main: Accepting attribute updates
>>>>>>>> attrd[4947]:   notice: main: Starting mainloop...
>>>>>>>> cib[4937]:     info: startCib: CIB Initialization completed 
>>>>>>>> successfully
>>>>>>>> cib[4937]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>>> cib[4937]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> cib[4937]:     info: init_ais_connection_classic: Creating connection 
>>>>>>>> to our Corosync plugin
>>>>>>>> cib[4944]:     info: startCib: CIB Initialization completed 
>>>>>>>> successfully
>>>>>>>> cib[4944]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>>> cib[4944]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> cib[4944]:     info: init_ais_connection_classic: Creating connection 
>>>>>>>> to our Corosync plugin
>>>>>>>> cib[4937]:     info: init_ais_connection_classic: AIS connection 
>>>>>>>> established
>>>>>>>> cib[4937]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>>> uname=node03 cname=pcmk
>>>>>>>> cib[4937]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>>> openais (with plugin)': established
>>>>>>>> cib[4937]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>>> cib[4937]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>>> node03
>>>>>>>> cib[4937]:     info: cib_init: Starting cib mainloop
>>>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6892: quorum 
>>>>>>>> still lost
>>>>>>>> cib[4937]:     info: crm_update_peer: Node node03: id=1003428268 
>>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59)  (new) votes=1 (new) 
>>>>>>>> born=0 seen=6892 proc=00000000000000000000000000111312 (new)
>>>>>>>> cib[4944]:     info: init_ais_connection_classic: AIS connection 
>>>>>>>> established
>>>>>>>> cib[4944]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>>> uname=node03 cname=pcmk
>>>>>>>> cib[4944]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>>> openais (with plugin)': established
>>>>>>>> cib[4944]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>>> cib[4944]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>>> node03
>>>>>>>> cib[4944]:     info: cib_init: Starting cib mainloop
>>>>>>>> stonith-ng[4945]:   notice: setup_cib: Watching for stonith topology 
>>>>>>>> changes
>>>>>>>> stonith-ng[4945]:     info: main: Starting stonith-ng mainloop
>>>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>>>> still lost
>>>>>>>> corosync[4931]:   [TOTEM ] A processor joined or left the membership 
>>>>>>>> and a new membership was formed.
>>>>>>>> cib[4937]:     info: crm_new_peer: Node <null> now has id: 969873836
>>>>>>>> cib[4937]:     info: crm_update_peer: Node (null): id=969873836 
>>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57)  votes=0 born=0 
>>>>>>>> seen=6896 proc=00000000000000000000000000000000
>>>>>>>> cib[4937]:     info: crm_new_peer: Node <null> now has id: 986651052
>>>>>>>> cib[4937]:     info: crm_update_peer: Node (null): id=986651052 
>>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58)  votes=0 born=0 
>>>>>>>> seen=6896 proc=00000000000000000000000000000000
>>>>>>>> cib[4937]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>>>> acquired
>>>>>>>> cib[4937]:     info: crm_get_peer: Node 986651052 is now known as 
>>>>>>>> node02
>>>>>>>> cib[4937]:     info: crm_update_peer: Node node02: id=986651052 
>>>>>>>> state=member addr=r(0) ip(172.25.207.58)  votes=1 (new) born=6812 
>>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>>>> retained
>>>>>>>> cib[4937]:     info: crm_get_peer: Node 969873836 is now known as 
>>>>>>>> node01
>>>>>>>> cib[4937]:     info: crm_update_peer: Node node01: id=969873836 
>>>>>>>> state=member addr=r(0) ip(172.25.207.57)  votes=1 (new) born=6848 
>>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4931 due to 
>>>>>>>> rate-limiting
>>>>>>>> crmd[4942]:     info: do_cib_control: CIB connection established
>>>>>>>> crmd[4942]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>>> crmd[4942]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> crmd[4942]:     info: init_ais_connection_classic: Creating connection 
>>>>>>>> to our Corosync plugin
>>>>>>>> cib[4937]:     info: cib_process_diff: Diff 1.249.28 -> 1.249.29 not 
>>>>>>>> applied to 1.249.0: current "num_updates" is less than required
>>>>>>>> cib[4937]:     info: cib_server_process_diff: Requesting re-sync from 
>>>>>>>> peer
>>>>>>>> crmd[4949]:     info: do_cib_control: CIB connection established
>>>>>>>> crmd[4949]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>>> crmd[4949]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>>> crmd[4949]:     info: init_ais_connection_classic: Creating connection 
>>>>>>>> to our Corosync plugin
>>>>>>>> stonith-ng[4938]:   notice: setup_cib: Watching for stonith topology 
>>>>>>>> changes
>>>>>>>> stonith-ng[4938]:     info: main: Starting stonith-ng mainloop
>>>>>>>> cib[4937]:   notice: cib_server_process_diff: Not applying diff 
>>>>>>>> 1.249.29 -> 1.249.30 (sync in progress)
>>>>>>>> crmd[4942]:     info: init_ais_connection_classic: AIS connection 
>>>>>>>> established
>>>>>>>> crmd[4942]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>>> uname=node03 cname=pcmk
>>>>>>>> crmd[4942]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>>> openais (with plugin)': established
>>>>>>>> crmd[4942]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>>> crmd[4942]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>>> node03
>>>>>>>> crmd[4942]:     info: ais_status_callback: status: node03 is now 
>>>>>>>> unknown
>>>>>>>> crmd[4942]:     info: do_ha_control: Connected to the cluster
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 1 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: init_ais_connection_classic: AIS connection 
>>>>>>>> established
>>>>>>>> crmd[4949]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>>> uname=node03 cname=pcmk
>>>>>>>> crmd[4949]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>>> openais (with plugin)': established
>>>>>>>> crmd[4942]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>>>> acquired
>>>>>>>> crmd[4949]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>>> crmd[4949]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>>> node03
>>>>>>>> crmd[4942]:     info: crm_new_peer: Node node01 now has id: 969873836
>>>>>>>> crmd[4949]:     info: ais_status_callback: status: node03 is now 
>>>>>>>> unknown
>>>>>>>> crmd[4942]:     info: crm_new_peer: Node 969873836 is now known as 
>>>>>>>> node01
>>>>>>>> crmd[4949]:     info: do_ha_control: Connected to the cluster
>>>>>>>> crmd[4942]:     info: ais_status_callback: status: node01 is now 
>>>>>>>> unknown
>>>>>>>> crmd[4942]:     info: ais_status_callback: status: node01 is now 
>>>>>>>> member (was unknown)
>>>>>>>> crmd[4942]:     info: crm_update_peer: Node node01: id=969873836 
>>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57)  votes=1 born=6848 
>>>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>>>> crmd[4942]:     info: crm_new_peer: Node node02 now has id: 986651052
>>>>>>>> crmd[4942]:     info: crm_new_peer: Node 986651052 is now known as 
>>>>>>>> node02
>>>>>>>> crmd[4942]:     info: ais_status_callback: status: node02 is now 
>>>>>>>> unknown
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 1 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: ais_status_callback: status: node02 is now 
>>>>>>>> member (was unknown)
>>>>>>>> crmd[4942]:     info: crm_update_peer: Node node02: id=986651052 
>>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58)  votes=1 born=6812 
>>>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>>>> crmd[4942]:   notice: crmd_peer_update: Status update: Client 
>>>>>>>> node03/crmd now has status [online] (DC=<null>)
>>>>>>>> crmd[4942]:     info: ais_status_callback: status: node03 is now 
>>>>>>>> member (was unknown)
>>>>>>>> crmd[4942]:     info: crm_update_peer: Node node03: id=1003428268 
>>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59)  (new) votes=1 (new) 
>>>>>>>> born=6896 seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>>> crmd[4942]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>>>> retained
>>>>>>>> cib[4937]:   notice: cib_server_process_diff: Not applying diff 
>>>>>>>> 1.249.30 -> 1.249.31 (sync in progress)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 2 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 3 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 2 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>>>> acquired
>>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4937 due to 
>>>>>>>> rate-limiting
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 4 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 5 
>>>>>>>> (30 max) times
>>>>>>>> pengine[4948]:     info: main: Starting pengine
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 3 
>>>>>>>> (30 max) times
>>>>>>>> attrd[4940]:     info: cib_connect: Connected to the CIB after 1 
>>>>>>>> signon attempts
>>>>>>>> attrd[4940]:     info: cib_connect: Sending full refresh
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 7 
>>>>>>>> (30 max) times
>>>>>>>> attrd[4947]:     info: cib_connect: Connected to the CIB after 1 
>>>>>>>> signon attempts
>>>>>>>> attrd[4947]:     info: cib_connect: Sending full refresh
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 4 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 8 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 5 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 9 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 6 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 10 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 7 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 11 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 8 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 12 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 9 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 13 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 10 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 14 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 11 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 12 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 15 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 13 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 16 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 14 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 17 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 15 
>>>>>>>> (30 max) times
>>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just 
>>>>>>>> popped (2000ms)
>>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 18 
>>>>>>>> (30 max) times
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We have the following components installed..
>>>>>>>> 
>>>>>>>> 
>>>>>>>> corosynclib-1.4.1-15.el6.x86_64
>>>>>>>> corosync-1.4.1-15.el6.x86_64
>>>>>>>> cluster-glue-libs-1.0.5-6.el6.x86_64
>>>>>>>> clusterlib-3.0.12.1-49.el6.x86_64
>>>>>>>> pacemaker-cluster-libs-1.1.7-6.el6.x86_64
>>>>>>>> cluster-glue-1.0.5-6.el6.x86_64
>>>>>>>> resource-agents-3.9.2-12.el6.x86_64
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We'd appreciate assistance on how to debug what the issue may be and 
>>>>>>>> some possible causes.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Jimmy
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> Linux-HA@lists.linux-ha.org
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> Linux-HA@lists.linux-ha.org
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> Linux-HA@lists.linux-ha.org
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> Linux-HA@lists.linux-ha.org
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> 
>>> 
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failed to sign on to the LRM error on Corosync Startup

Reply via email to