On 16/04/2013, at 7:39 AM, Jimmy Magee <jimmy.ma...@vennetics.com> wrote:
> Hi Andrew, > > Thanks for your reply, we tried that option but to no avail. > > To resolve the issue what worked for us was to remove existing ha packages > and update to pacemaker to 1.18-7. Just moving to cman (without updating the packages) would/should have also worked. But upgrading was probably a good idea anyway :) > > Here is the procedure… > > 1. Backup /etc/corosync/corosync.conf, /etc/corosync/authkey. > 2. Export cib.xml: > cibadmin -Q > /tmp/ha_backup/cib.xml > 3. Stop corosync services on all nodes > 4. Remove existing HA packages: > yum -y remove pacemaker corosync heartbeat resource-agents > cluster-glue rgmanager lvm2-cluster gfs2-utils > 5. Install updated HA Packages: > yum -y install pacemaker cman ccs resource agents > > resulting in the following packages being installed.. > pacemaker-doc-1.1.8-7.el6.x86_64 > pacemaker-cli-1.1.8-7.el6.x86_64 > pacemaker-libs-1.1.8-7.el6.x86_64 > pacemaker-cts-1.1.8-7.el6.x86_64 > pacemaker-libs-devel-1.1.8-7.el6.x86_64 > pacemaker-cluster-libs-1.1.8-7.el6.x86_64 > pacemaker-1.1.8-7.el6.x86_64 > pacemaker-debuginfo-1.1.8-7.el6.x86_64 > cman-3.0.12.1-49.el6.x86_64 > ccs-0.16.2-55.el6.x86_64 > resource-agents-3.9.2-12.el6.x86_64 > cluster-glue-libs-1.0.5-6.el6.x86_64 > corosync-1.4.1-15.el6.x86_64 > corosynclib-1.4.1-15.el6.x86_64 > corosync-debuginfo-1.4.1-15.el6.x86_64 > corosynclib-devel-1.4.1-15.el6.x86_64 > > 6. Get crm package and install: > yum -y install crmsh* > 7. Start the ricci service: > service ricci start > Also ensure it restarts: > chkconfig --add ricci > 8. Set ricci passwd: > passwd ricci > 9. Configure the cluster: > ccs -f /etc/cluster/cluster.conf --createcluster testprod -i > ccs -f /etc/cluster/cluster.conf --addnode w0110Danmtapp01 > ccs -f /etc/cluster/cluster.conf --addnode node02 > ccs -f /etc/cluster/cluster.conf --addnode node03 > ccs -f /etc/cluster/cluster.conf --addfencedev pcmk > agent=fence_pcmk > ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect > node01 > ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect > node02 > ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect > node03 > ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node01 > pcmk-redirect port=1 > ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node02 > pcmk-redirect port=2 > ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node03 > pcmk-redirect port=3 > ccs -f /etc/cluster/cluster.conf --setlogging debug=on > ccs -f /etc/cluster/cluster.conf --settotem > 10. Distribute the cluster.conf: > ccs -h node01 -p ************** --sync --activate > 11. Set Cluster timeout to 0 on all three nodes separately: > echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman > 12. Start the services on each node: > service cman start > service pacemaker start > also ensure they restart: > chkconfig --add cman > chkconfig --add pacemaker > > > Best of luck, > Jimmy. > > > > > On 12 Apr 2013, at 02:11, Andrew Beekhof <and...@beekhof.net> wrote: > >> >> On 11/04/2013, at 6:05 AM, Jimmy Magee <jimmy.ma...@vennetics.com> wrote: >> >>> Hi, >>> >>> Following up on the above thread, any thoughts as to what may be causing >>> the issue.. >> >> One of the main reasons pacemakerd was created was to avoid weirdness around >> the starting of pacemaker's child processes from within a multi-threaded >> application like corosync... which is almost certainly what you're bumping >> into here. >> >> Could you try using "ver: 1" in corosync.conf and "service pacemaker start" >> to rule out any other causes? >> >>> >>> Cheers, >>> Jimmy. >>> >>> >>> >>> On 9 Apr 2013, at 13:39, Jimmy Magee <jimmy.ma...@vennetics.com> wrote: >>> >>>> Hi Andrew, >>>> >>>> The corosync.conf is configured as follows: >>>> >>>> >>>>> service { >>>>> # Load the Pacemaker Cluster Resource Manager >>>>> name: pacemaker >>>>> ver: 0 >>>>> } >>>> >>>> >>>> >>>> and pacemaker is not started via service pacemaker start… >>>> >>>> here is the extract from the logs with extra debug when attempting to >>>> start corosync/pacemaker.. >>>> >>>> 06:59:20 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and >>>> ready to provide service. >>>> 06:59:20 corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp >>>> 06:59:20 corosync [MAIN ] Successfully read main configuration file >>>> '/etc/corosync/corosync.conf'. >>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1 >>>> 06:59:20 corosync [TOTEM ] Token Timeout (5000 ms) retransmit timeout (247 >>>> ms) >>>> 06:59:20 corosync [TOTEM ] token hold (187 ms) retransmits before loss (20 >>>> retrans) >>>> 06:59:20 corosync [TOTEM ] join (1000 ms) send_join (0 ms) consensus (7500 >>>> ms) merge (200 ms) >>>> 06:59:20 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (2500 >>>> msgs) >>>> 06:59:20 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum >>>> network MTU 1402 >>>> 06:59:20 corosync [TOTEM ] window size per rotation (50 messages) maximum >>>> messages per rotation (20 messages) >>>> 06:59:20 corosync [TOTEM ] missed count const (5 messages) >>>> 06:59:20 corosync [TOTEM ] send threads (0 threads) >>>> 06:59:20 corosync [TOTEM ] RRP token expired timeout (247 ms) >>>> 06:59:20 corosync [TOTEM ] RRP token problem counter (2000 ms) >>>> 06:59:20 corosync [TOTEM ] RRP threshold (10 problem count) >>>> 06:59:20 corosync [TOTEM ] RRP multicast threshold (100 problem count) >>>> 06:59:20 corosync [TOTEM ] RRP automatic recovery check timeout (1000 ms) >>>> 06:59:20 corosync [TOTEM ] RRP mode set to none. >>>> 06:59:20 corosync [TOTEM ] heartbeat_failures_allowed (0) >>>> 06:59:20 corosync [TOTEM ] max_network_delay (50 ms) >>>> 06:59:20 corosync [TOTEM ] HeartBeat is Disabled. To enable set >>>> heartbeat_failures_allowed > 0 >>>> 06:59:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). >>>> 06:59:20 corosync [TOTEM ] Initializing transmit/receive security: >>>> libtomcrypt SOBER128/SHA1HMAC (mode 0). >>>> 06:59:20 corosync [IPC ] you are using ipc api v2 >>>> 06:59:20 corosync [TOTEM ] Receive multicast socket recv buffer size >>>> (320000 bytes). >>>> 06:59:20 corosync [TOTEM ] Transmit multicast socket send buffer size >>>> (320000 bytes). >>>> 06:59:20 corosync [TOTEM ] Local receive multicast loop socket recv buffer >>>> size (320000 bytes). >>>> 06:59:20 corosync [TOTEM ] Local transmit multicast loop socket send >>>> buffer size (320000 bytes). >>>> 06:59:20 corosync [TOTEM ] The network interface [10.87.79.59] is now up. >>>> 06:59:20 corosync [TOTEM ] Created or loaded sequence id 6984.10.87.79.59 >>>> for this ring. >>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log >>>> 06:59:20 corosync [pcmk ] Logging: Initialized pcmk_startup >>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log >>>> 06:59:20 corosync [SERV ] Service engine loaded: Pacemaker Cluster >>>> Manager 1.1.6 >>>> 06:59:20 corosync [pcmk ] Logging: Initialized pcmk_startup >>>> 06:59:20 corosync [SERV ] Service engine loaded: Pacemaker Cluster >>>> Manager 1.1.6 >>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync extended >>>> virtual synchrony service >>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync configuration >>>> service >>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster closed >>>> process group service v1.01 >>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster config >>>> database access v1.01 >>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync profile loading >>>> service >>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster quorum >>>> service v0.1 >>>> 06:59:20 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 >>>> and V2 of the synchronization engine. >>>> 06:59:20 corosync [TOTEM ] entering GATHER state from 15. >>>> 06:59:20 corosync [TOTEM ] Creating commit token because I am the rep. >>>> 06:59:20 corosync [TOTEM ] Saving state aru 0 high seq received 0 >>>> 06:59:20 corosync [TOTEM ] Storing new sequence id for ring 1b4c >>>> 06:59:20 corosync [TOTEM ] entering COMMIT state. >>>> 06:59:20 corosync [TOTEM ] got commit token >>>> 06:59:20 corosync [TOTEM ] entering RECOVERY state. >>>> 06:59:20 corosync [TOTEM ] position [0] member 10.87.79.59: >>>> 06:59:20 corosync [TOTEM ] previous ring seq 6984 rep 10.87.79.59 >>>> 06:59:20 corosync [TOTEM ] aru 0 high delivered 0 received flag 1 >>>> 06:59:20 corosync [TOTEM ] Did not need to originate any messages in >>>> recovery. >>>> 06:59:20 corosync [TOTEM ] got commit token >>>> 06:59:20 corosync [TOTEM ] Sending initial ORF token >>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >>>> retrans queue empty 1 count 0, aru 0 >>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >>>> retrans queue empty 1 count 1, aru 0 >>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >>>> retrans queue empty 1 count 2, aru 0 >>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >>>> retrans queue empty 1 count 3, aru 0 >>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >>>> 06:59:20 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 >>>> aru 0 0 >>>> 06:59:20 corosync [TOTEM ] Resetting old ring state >>>> 06:59:20 corosync [TOTEM ] recovery to regular 1-0 >>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1 >>>> 06:59:20 corosync [SYNC ] This node is within the primary component and >>>> will provide service. >>>> 06:59:20 corosync [TOTEM ] entering OPERATIONAL state. >>>> 06:59:20 corosync [TOTEM ] A processor joined or left the membership and a >>>> new membership was formed. >>>> 06:59:20 corosync [SYNC ] confchg entries 1 >>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 >>>> = 1. >>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed >>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy CLM >>>> service) >>>> 06:59:20 corosync [SYNC ] confchg entries 1 >>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 >>>> = 1. >>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed >>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy CLM >>>> service) >>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy AMF >>>> service) >>>> 06:59:20 corosync [SYNC ] confchg entries 1 >>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 >>>> = 1. >>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed >>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy AMF >>>> service) >>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy >>>> CKPT service) >>>> 06:59:20 corosync [SYNC ] confchg entries 1 >>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 >>>> = 1. >>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed >>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy CKPT >>>> service) >>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy EVT >>>> service) >>>> 06:59:20 corosync [SYNC ] confchg entries 1 >>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 >>>> = 1. >>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed >>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy EVT >>>> service) >>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (corosync >>>> cluster closed process group service v1.01) >>>> 06:59:20 corosync [CPG ] comparing: sender r(0) ip(10.87.79.59) ; >>>> members(old:0 left:0) >>>> 06:59:20 corosync [CPG ] chosen downlist: sender r(0) ip(10.87.79.59) ; >>>> members(old:0 left:0) >>>> 06:59:20 corosync [SYNC ] confchg entries 1 >>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 >>>> = 1. >>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed >>>> 06:59:20 corosync [SYNC ] Committing synchronization for (corosync >>>> cluster closed process group service v1.01) >>>> 06:59:20 corosync [MAIN ] Completed service synchronization, ready to >>>> provide service. >>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 0 >>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >>>> handler for signal 15 >>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >>>> handler for signal 17 >>>> 06:59:20node03lrmd: [14934]: info: enabling coredumps >>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >>>> handler for signal 10 >>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >>>> handler for signal 12 >>>> 06:59:20node03lrmd: [14934]: debug: main: run the loop... >>>> 06:59:20node03lrmd: [14934]: info: Started. >>>> 06:59:20 [14935]node03 attrd: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> 06:59:20 [14935]node03 attrd: info: main: Starting up >>>> 06:59:20 [14935]node03 attrd: info: get_cluster_type: Cluster >>>> type is: 'openais' >>>> 06:59:20 [14935]node03 attrd: notice: crm_cluster_connect: >>>> Connecting to cluster infrastructure: classic openais (with plugin) >>>> 06:59:20 [14936]node03 pengine: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> 06:59:20 [14935]node03 attrd: info: init_ais_connection_classic: >>>> Creating connection to our Corosync plugin >>>> 06:59:20 [14936]node03 pengine: debug: main: Checking for old >>>> instances of pengine >>>> 06:59:20 [14937]node03 crmd: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> 06:59:20 [14936]node03 pengine: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/pengine >>>> 06:59:20 [14937]node03 crmd: notice: main: CRM Hg Version: >>>> 148fccfd5985c5590cc601123c6c16e966b85d14 >>>> 06:59:20 [14936]node03 pengine: debug: >>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>> /var/run/crm/pengine >>>> 06:59:20 [14936]node03 pengine: debug: main: Init server comms >>>> 06:59:20 [14936]node03 pengine: info: main: Starting pengine >>>> 06:59:20 [14937]node03 crmd: debug: crmd_init: Starting crmd >>>> 06:59:20 [14937]node03 crmd: debug: s_crmd_fsa: Processing >>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] >>>> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: >>>> actions:trace: // A_LOG >>>> 06:59:20 [14937]node03 crmd: debug: do_log: FSA: Input >>>> I_STARTUP from crmd_init() received in state S_STARTING >>>> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: >>>> actions:trace: // A_STARTUP >>>> 06:59:20 [14937]node03 crmd: debug: do_startup: Registering >>>> Signal Handlers >>>> 06:59:20 [14937]node03 crmd: debug: do_startup: Creating CIB >>>> and LRM objects >>>> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: >>>> actions:trace: // A_CIB_START >>>> 06:59:20 [14937]node03 crmd: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/cib_rw >>>> 06:59:20 [14937]node03 crmd: debug: >>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>> /var/run/crm/cib_rw >>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw: >>>> Connection to command channel failed >>>> 06:59:20 [14937]node03 crmd: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/cib_callback >>>> 06:59:20 [14937]node03 crmd: debug: >>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>> /var/run/crm/cib_callback >>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw: >>>> Connection to callback channel failed >>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw: >>>> Connection to CIB failed: connection failed >>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signoff: Signing >>>> out of the CIB Service >>>> 06:59:20 [14935]node03 attrd: debug: init_ais_connection_classic: >>>> Adding fd=6 to mainloop >>>> 06:59:20 [14935]node03 attrd: info: init_ais_connection_classic: >>>> AIS connection established >>>> 06:59:20 [14935]node03 attrd: info: get_ais_nodeid: Server >>>> details: id=1003428268 uname=node03 cname=pcmk >>>> 06:59:20 [14935]node03 attrd: info: init_ais_connection_once: >>>> Connection to 'classic openais (with plugin)': established >>>> 06:59:20 [14935]node03 attrd: debug: crm_new_peer: Creating entry >>>> for node node03/1003428268 >>>> 06:59:20 [14935]node03 attrd: info: crm_new_peer: Nodenode03now >>>> has id: 1003428268 >>>> 06:59:20 [14935]node03 attrd: info: crm_new_peer: Node 1003428268 >>>> is now known as node03 >>>> 06:59:20 [14935]node03 attrd: info: main: Cluster connection >>>> active >>>> 06:59:20 [14935]node03 attrd: info: main: Accepting attribute >>>> updates >>>> 06:59:20 [14935]node03 attrd: notice: main: Starting mainloop... >>>> 06:59:20 [14933]node03stonith-ng: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/root >>>> 06:59:20 [14933]node03stonith-ng: info: get_cluster_type: Cluster >>>> type is: 'openais' >>>> 06:59:20 [14933]node03stonith-ng: notice: crm_cluster_connect: >>>> Connecting to cluster infrastructure: classic openais (with plugin) >>>> 06:59:20 [14933]node03stonith-ng: info: init_ais_connection_classic: >>>> Creating connection to our Corosync plugin >>>> 06:59:20 [14932]node03 cib: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> 06:59:20 [14932]node03 cib: info: retrieveCib: Reading cluster >>>> configuration from: /var/lib/heartbeat/crm/cib.xml (digest: >>>> /var/lib/heartbeat/crm/cib.xml.sig) >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <cib epoch="251" num_updates="0" admin_epoch="1" >>>> validate-with="pacemaker-1.2" crm_feature_set="3.0.6" >>>> update-origin="node03" update-client="crmd" cib-last-written="Tue Apr 9 >>>> 06:48:33 2013" have-quorum="1" > >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <configuration > >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <crm_config > >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <cluster_property_set id="cib-bootstrap-options" > >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair >>>> id="cib-bootstrap-options-default-resource-stickiness" >>>> name="default-resource-stickiness" value="1000" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair id="cib-bootstrap-options-no-quorum-policy" >>>> name="no-quorum-policy" value="ignore" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair id="cib-bootstrap-options-stonith-enabled" >>>> name="stonith-enabled" value="false" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair id="cib-bootstrap-options-expected-quorum-votes" >>>> name="expected-quorum-votes" value="3" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair id="cib-bootstrap-options-dc-version" >>>> name="dc-version" >>>> value="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair >>>> id="cib-bootstrap-options-cluster-infrastructure" >>>> name="cluster-infrastructure" value="openais" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] <nvpair id="cib-bootstrap-options-last-lrm-refresh" >>>> name="last-lrm-refresh" value="1365160119" /> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] </cluster_property_set> >>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: >>>> [on-disk] </crm_config> >>>> … >>>> … >>>> ... >>>> >>>> >>>> We are still seeing the extra pacemaker daemons when corosync starts up. >>>> As an added check, all pacemaker daemons exited correctly when stoping >>>> corosync. >>>> ldmd attempts to start twice.. >>>> >>>> ps aux | grep lrmd >>>> root 16412 0.0 0.0 0 0 ? Z 07:20 0:00 [lrmd] >>>> <defunct> >>>> root 16419 0.0 0.0 34240 1052 ? S 07:20 0:00 >>>> /usr/lib64/heartbeat/lrmd >>>> root 21030 0.0 0.0 103244 856 pts/0 S+ 08:37 0:00 grep lrmd >>>> >>>> >>>> Help to resolve this issue appreciated.. >>>> >>>> Cheers, >>>> Jimmy. >>>> >>>> >>>> On 9 Apr 2013, at 00:16, Andrew Beekhof <and...@beekhof.net> wrote: >>>> >>>>> >>>>> On 08/04/2013, at 9:44 PM, Jimmy Magee <jimmy.ma...@vennetics.com> wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> thanks for your reply, we are running at debug level with the following >>>>>> config from corosync.conf >>>>>> >>>>>> logging { >>>>>> fileline: off >>>>>> to_syslog: yes >>>>>> to_stderr: no >>>>>> syslog_facility: daemon >>>>>> debug: on >>>>>> timestamp: on >>>>>> } >>>>>> >>>>>> Looking at the issue further, there seems to be 2 instances of some >>>>>> pacemaker daemons running on this particular node…. >>>>>> >>>>>> >>>>>> ps aux | grep pace >>>>>> >>>>>> 495 3050 0.2 0.0 89956 7184 ? S 07:10 0:01 >>>>>> /usr/libexec/pacemaker/cib >>>>>> root 3051 0.0 0.0 87128 3152 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/stonithd >>>>>> 495 3053 0.0 0.0 91188 2840 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/attrd >>>>>> 495 3054 0.0 0.0 87336 2484 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/pengine >>>>>> 495 3055 0.0 0.0 91332 3156 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/crmd >>>>>> 495 3057 0.0 0.0 88876 5224 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/cib >>>>>> root 3058 0.0 0.0 87128 3132 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/stonithd >>>>>> 495 3060 0.0 0.0 91188 2788 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/attrd >>>>>> 495 3062 0.0 0.0 91436 3932 ? S 07:10 0:00 >>>>>> /usr/libexec/pacemaker/crmd >>>>>> >>>>>> >>>>>> ps aux | grep corosync >>>>>> root 3044 0.1 0.0 977852 9264 ? Ssl 07:10 0:01 corosync >>>>>> root 9363 0.0 0.0 103248 856 pts/0 S+ 07:33 0:00 grep >>>>>> corosync >>>>>> >>>>>> >>>>>> ps aux | grep lrmd >>>>>> root 3052 0.0 0.0 76464 2528 ? S 07:10 0:00 >>>>>> /usr/lib64/heartbeat/lrmd >>>>>> >>>>>> >>>>>> Not sure why this is the case? Appreciate any help.. >>>>>> >>>>> >>>>> Have you perhaps specified "ver: 0" for the pacemaker plugin and run >>>>> "service pacemaker start" ? >>>>> >>>>>> Cheers, >>>>>> Jimmy. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 8 Apr 2013, at 03:00, Andrew Beekhof <and...@beekhof.net> wrote: >>>>>> >>>>>>> This doesn't look promising: >>>>>>> >>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>> signal 15 >>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit >>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>> signal 17 >>>>>>> lrmd: [4939]: info: enabling coredumps >>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>> signal 10 >>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>> signal 12 >>>>>>> lrmd: [4939]: info: Started. >>>>>>> lrmd: [4939]: info: lrmd is shutting down >>>>>>> >>>>>>> The lrmd comes up but then immediately shuts down. >>>>>>> Perhaps try enabling debug to see if that sheds any light. >>>>>>> >>>>>>> On 06/04/2013, at 4:58 AM, Jimmy Magee <jimmy.ma...@vennetics.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> Apologies for reposting this query, it inadvertently got added to an >>>>>>>> existing topic! >>>>>>>> >>>>>>>> >>>>>>>> We have a three node cluster deployed in a customer's network: >>>>>>>> - 2 nodes are on the same switch >>>>>>>> - 3rd node on the same subnet but there's a router in between. >>>>>>>> - IP Multicast is enabled and has been tested using omping as follows.. >>>>>>>> >>>>>>>> On each node ran.. >>>>>>>> >>>>>>>> omping node01 node02 node3 >>>>>>>> >>>>>>>> >>>>>>>> ON node 3 >>>>>>>> >>>>>>>> Node01 : unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = >>>>>>>> 0.128/0.181/0.255/0.025 >>>>>>>> Node01 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = >>>>>>>> 0.140/0.187/0.219/0.021 >>>>>>>> Node02 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>>>> 0.115/0.150/0.168/0.021 >>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>>>> 0.134/0.162/0.177/0.014 >>>>>>>> >>>>>>>> >>>>>>>> On node 2 >>>>>>>> >>>>>>>> >>>>>>>> Node01 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = >>>>>>>> 0.168/0.191/0.205/0.014 >>>>>>>> Node01 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), >>>>>>>> min/avg/max/std-dev = 0.138/0.179/0.206/0.028 >>>>>>>> Node03 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = >>>>>>>> 0.112/0.149/0.175/0.022 >>>>>>>> Node03 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), >>>>>>>> min/avg/max/std-dev = 0.124/0.167/0.178/0.018 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On node 1 >>>>>>>> >>>>>>>> Node02 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>>>> 0.154/0.185/0.208/0.019 >>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>>>> 0.175/0.198/0.214/0.015 >>>>>>>> Node03 : unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = >>>>>>>> 0.114/0.160/0.185/0.019 >>>>>>>> Node03 : multicast, xmt/rcv/%loss = 23/22/4% (seq>=2 0%), >>>>>>>> min/avg/max/std-dev = 0.124/0.172/0.197/0.019 >>>>>>>> >>>>>>>> >>>>>>>> - Problem is intermittent but frequent. Occasionally starts fine when >>>>>>>> started from scratch. >>>>>>>> >>>>>>>> We suspect the problem is related to node 3 as we can see lrmd >>>>>>>> failures as per the attached log. We've checked permissions are ok as >>>>>>>> per https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> stonith-ng[1437]: error: ais_dispatch: AIS connection failed >>>>>>>> stonith-ng[1437]: error: stonith_peer_ais_destroy: AIS connection >>>>>>>> terminated >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: Pacemaker Cluster >>>>>>>> Manager 1.1.6 >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync extended >>>>>>>> virtual synchrony service >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync >>>>>>>> configuration service >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster >>>>>>>> closed process group service v1.01 >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster >>>>>>>> config database access v1.01 >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync profile >>>>>>>> loading service >>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster >>>>>>>> quorum service v0.1 >>>>>>>> corosync[1430]: [MAIN ] Corosync Cluster Engine exiting with status >>>>>>>> 0 at main.c:1894. >>>>>>>> >>>>>>>> corosync[4931]: [MAIN ] Corosync built-in features: nss dbus rdma >>>>>>>> snmp >>>>>>>> corosync[4931]: [MAIN ] Successfully read main configuration file >>>>>>>> '/etc/corosync/corosync.conf'. >>>>>>>> corosync[4931]: [TOTEM ] Initializing transport (UDP/IP Multicast). >>>>>>>> corosync[4931]: [TOTEM ] Initializing transmit/receive security: >>>>>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0). >>>>>>>> corosync[4931]: [TOTEM ] The network interface [10.87.79.59] is now >>>>>>>> up. >>>>>>>> corosync[4931]: [pcmk ] Logging: Initialized pcmk_startup >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: Pacemaker Cluster >>>>>>>> Manager 1.1.6 >>>>>>>> corosync[4931]: [pcmk ] Logging: Initialized pcmk_startup >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: Pacemaker Cluster >>>>>>>> Manager 1.1.6 >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync extended >>>>>>>> virtual synchrony service >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync >>>>>>>> configuration service >>>>>>>> orosync[4931]: [SERV ] Service engine loaded: corosync cluster >>>>>>>> closed process group service v1.01 >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync cluster >>>>>>>> config database access v1.01 >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync profile >>>>>>>> loading service >>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync cluster >>>>>>>> quorum service v0.1 >>>>>>>> corosync[4931]: [MAIN ] Compatibility mode set to whitetank. Using >>>>>>>> V1 and V2 of the synchronization engine. >>>>>>>> corosync[4931]: [TOTEM ] A processor joined or left the membership >>>>>>>> and a new membership was formed. >>>>>>>> corosync[4931]: [CPG ] chosen downlist: sender r(0) >>>>>>>> ip(10.87.79.59) ; members(old:0 left:0) >>>>>>>> corosync[4931]: [MAIN ] Completed service synchronization, ready to >>>>>>>> provide service. >>>>>>>> cib[4937]: info: crm_log_init_worker: Changed active directory to >>>>>>>> /var/lib/heartbeat/cores/hacluster >>>>>>>> cib[4937]: info: retrieveCib: Reading cluster configuration from: >>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: >>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig) >>>>>>>> cib[4937]: info: validate_with_relaxng: Creating RNG parser context >>>>>>>> stonith-ng[4945]: info: crm_log_init_worker: Changed active >>>>>>>> directory to /var/lib/heartbeat/cores/root >>>>>>>> stonith-ng[4945]: info: get_cluster_type: Cluster type is: >>>>>>>> 'openais' >>>>>>>> stonith-ng[4945]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> stonith-ng[4945]: info: init_ais_connection_classic: Creating >>>>>>>> connection to our Corosync plugin >>>>>>>> cib[4944]: info: crm_log_init_worker: Changed active directory to >>>>>>>> /var/lib/heartbeat/cores/hacluster >>>>>>>> cib[4944]: info: retrieveCib: Reading cluster configuration from: >>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: >>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig) >>>>>>>> stonith-ng[4945]: info: init_ais_connection_classic: AIS >>>>>>>> connection established >>>>>>>> stonith-ng[4945]: info: get_ais_nodeid: Server details: >>>>>>>> id=1003428268 uname=w0110Danmtapp03 cname=pcmk >>>>>>>> stonith-ng[4945]: info: init_ais_connection_once: Connection to >>>>>>>> 'classic openais (with plugin)': established >>>>>>>> stonith-ng[4945]: info: crm_new_peer: Node node03 now has id: >>>>>>>> 1003428268 >>>>>>>> stonith-ng[4945]: info: crm_new_peer: Node 1003428268 is now known >>>>>>>> as node03 >>>>>>>> cib[4944]: info: validate_with_relaxng: Creating RNG parser context >>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>>> signal 15 >>>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to >>>>>>>> exit >>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>>> signal 17 >>>>>>>> lrmd: [4939]: info: enabling coredumps >>>>>>>> stonith-ng[4938]: info: crm_log_init_worker: Changed active >>>>>>>> directory to /var/lib/heartbeat/cores/root >>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>>> signal 10 >>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>>>> signal 12 >>>>>>>> lrmd: [4939]: info: Started. >>>>>>>> stonith-ng[4938]: info: get_cluster_type: Cluster type is: >>>>>>>> 'openais' >>>>>>>> lrmd: [4939]: info: lrmd is shutting down >>>>>>>> stonith-ng[4938]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> stonith-ng[4938]: info: init_ais_connection_classic: Creating >>>>>>>> connection to our Corosync plugin >>>>>>>> attrd[4940]: info: crm_log_init_worker: Changed active directory >>>>>>>> to /var/lib/heartbeat/cores/hacluster >>>>>>>> pengine[4941]: info: crm_log_init_worker: Changed active directory >>>>>>>> to /var/lib/heartbeat/cores/hacluster >>>>>>>> attrd[4940]: info: main: Starting up >>>>>>>> attrd[4940]: info: get_cluster_type: Cluster type is: 'openais' >>>>>>>> attrd[4940]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> attrd[4940]: info: init_ais_connection_classic: Creating >>>>>>>> connection to our Corosync plugin >>>>>>>> crmd[4942]: info: crm_log_init_worker: Changed active directory to >>>>>>>> /var/lib/heartbeat/cores/hacluster >>>>>>>> pengine[4941]: info: main: Starting pengine >>>>>>>> crmd[4942]: notice: main: CRM Hg Version: >>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14 >>>>>>>> pengine[4948]: info: crm_log_init_worker: Changed active directory >>>>>>>> to /var/lib/heartbeat/cores/hacluster >>>>>>>> pengine[4948]: warning: main: Terminating previous PE instance >>>>>>>> attrd[4947]: info: crm_log_init_worker: Changed active directory >>>>>>>> to /var/lib/heartbeat/cores/hacluster >>>>>>>> pengine[4941]: warning: process_pe_message: Received quit message, >>>>>>>> terminating >>>>>>>> attrd[4947]: info: main: Starting up >>>>>>>> attrd[4947]: info: get_cluster_type: Cluster type is: 'openais' >>>>>>>> attrd[4947]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> attrd[4947]: info: init_ais_connection_classic: Creating >>>>>>>> connection to our Corosync plugin >>>>>>>> crmd[4949]: info: crm_log_init_worker: Changed active directory to >>>>>>>> /var/lib/heartbeat/cores/hacluster >>>>>>>> crmd[4949]: notice: main: CRM Hg Version: >>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14 >>>>>>>> stonith-ng[4938]: info: init_ais_connection_classic: AIS >>>>>>>> connection established >>>>>>>> stonith-ng[4938]: info: get_ais_nodeid: Server details: >>>>>>>> id=1003428268 uname=node03 cname=pcmk >>>>>>>> stonith-ng[4938]: info: init_ais_connection_once: Connection to >>>>>>>> 'classic openais (with plugin)': established >>>>>>>> stonith-ng[4938]: info: crm_new_peer: Node node03 now has id: >>>>>>>> 1003428268 >>>>>>>> stonith-ng[4938]: info: crm_new_peer: Node 1003428268 is now known >>>>>>>> as node03 >>>>>>>> attrd[4940]: info: init_ais_connection_classic: AIS connection >>>>>>>> established >>>>>>>> attrd[4940]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>>>> uname=node03 cname=pcmk >>>>>>>> attrd[4940]: info: init_ais_connection_once: Connection to >>>>>>>> 'classic openais (with plugin)': established >>>>>>>> attrd[4940]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>>>> attrd[4940]: info: crm_new_peer: Node 1003428268 is now known as >>>>>>>> node03 >>>>>>>> attrd[4940]: info: main: Cluster connection active >>>>>>>> attrd[4940]: info: main: Accepting attribute updates >>>>>>>> attrd[4940]: notice: main: Starting mainloop... >>>>>>>> attrd[4947]: info: init_ais_connection_classic: AIS connection >>>>>>>> established >>>>>>>> attrd[4947]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>>>> uname=node03 cname=pcmk >>>>>>>> attrd[4947]: info: init_ais_connection_once: Connection to >>>>>>>> 'classic openais (with plugin)': established >>>>>>>> attrd[4947]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>>>> attrd[4947]: info: crm_new_peer: Node 1003428268 is now known as >>>>>>>> node03 >>>>>>>> attrd[4947]: info: main: Cluster connection active >>>>>>>> attrd[4947]: info: main: Accepting attribute updates >>>>>>>> attrd[4947]: notice: main: Starting mainloop... >>>>>>>> cib[4937]: info: startCib: CIB Initialization completed >>>>>>>> successfully >>>>>>>> cib[4937]: info: get_cluster_type: Cluster type is: 'openais' >>>>>>>> cib[4937]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> cib[4937]: info: init_ais_connection_classic: Creating connection >>>>>>>> to our Corosync plugin >>>>>>>> cib[4944]: info: startCib: CIB Initialization completed >>>>>>>> successfully >>>>>>>> cib[4944]: info: get_cluster_type: Cluster type is: 'openais' >>>>>>>> cib[4944]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> cib[4944]: info: init_ais_connection_classic: Creating connection >>>>>>>> to our Corosync plugin >>>>>>>> cib[4937]: info: init_ais_connection_classic: AIS connection >>>>>>>> established >>>>>>>> cib[4937]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>>>> uname=node03 cname=pcmk >>>>>>>> cib[4937]: info: init_ais_connection_once: Connection to 'classic >>>>>>>> openais (with plugin)': established >>>>>>>> cib[4937]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>>>> cib[4937]: info: crm_new_peer: Node 1003428268 is now known as >>>>>>>> node03 >>>>>>>> cib[4937]: info: cib_init: Starting cib mainloop >>>>>>>> cib[4937]: info: ais_dispatch_message: Membership 6892: quorum >>>>>>>> still lost >>>>>>>> cib[4937]: info: crm_update_peer: Node node03: id=1003428268 >>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59) (new) votes=1 (new) >>>>>>>> born=0 seen=6892 proc=00000000000000000000000000111312 (new) >>>>>>>> cib[4944]: info: init_ais_connection_classic: AIS connection >>>>>>>> established >>>>>>>> cib[4944]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>>>> uname=node03 cname=pcmk >>>>>>>> cib[4944]: info: init_ais_connection_once: Connection to 'classic >>>>>>>> openais (with plugin)': established >>>>>>>> cib[4944]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>>>> cib[4944]: info: crm_new_peer: Node 1003428268 is now known as >>>>>>>> node03 >>>>>>>> cib[4944]: info: cib_init: Starting cib mainloop >>>>>>>> stonith-ng[4945]: notice: setup_cib: Watching for stonith topology >>>>>>>> changes >>>>>>>> stonith-ng[4945]: info: main: Starting stonith-ng mainloop >>>>>>>> cib[4937]: info: ais_dispatch_message: Membership 6896: quorum >>>>>>>> still lost >>>>>>>> corosync[4931]: [TOTEM ] A processor joined or left the membership >>>>>>>> and a new membership was formed. >>>>>>>> cib[4937]: info: crm_new_peer: Node <null> now has id: 969873836 >>>>>>>> cib[4937]: info: crm_update_peer: Node (null): id=969873836 >>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57) votes=0 born=0 >>>>>>>> seen=6896 proc=00000000000000000000000000000000 >>>>>>>> cib[4937]: info: crm_new_peer: Node <null> now has id: 986651052 >>>>>>>> cib[4937]: info: crm_update_peer: Node (null): id=986651052 >>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58) votes=0 born=0 >>>>>>>> seen=6896 proc=00000000000000000000000000000000 >>>>>>>> cib[4937]: notice: ais_dispatch_message: Membership 6896: quorum >>>>>>>> acquired >>>>>>>> cib[4937]: info: crm_get_peer: Node 986651052 is now known as >>>>>>>> node02 >>>>>>>> cib[4937]: info: crm_update_peer: Node node02: id=986651052 >>>>>>>> state=member addr=r(0) ip(172.25.207.58) votes=1 (new) born=6812 >>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new) >>>>>>>> cib[4937]: info: ais_dispatch_message: Membership 6896: quorum >>>>>>>> retained >>>>>>>> cib[4937]: info: crm_get_peer: Node 969873836 is now known as >>>>>>>> node01 >>>>>>>> cib[4937]: info: crm_update_peer: Node node01: id=969873836 >>>>>>>> state=member addr=r(0) ip(172.25.207.57) votes=1 (new) born=6848 >>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new) >>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4931 due to >>>>>>>> rate-limiting >>>>>>>> crmd[4942]: info: do_cib_control: CIB connection established >>>>>>>> crmd[4942]: info: get_cluster_type: Cluster type is: 'openais' >>>>>>>> crmd[4942]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> crmd[4942]: info: init_ais_connection_classic: Creating connection >>>>>>>> to our Corosync plugin >>>>>>>> cib[4937]: info: cib_process_diff: Diff 1.249.28 -> 1.249.29 not >>>>>>>> applied to 1.249.0: current "num_updates" is less than required >>>>>>>> cib[4937]: info: cib_server_process_diff: Requesting re-sync from >>>>>>>> peer >>>>>>>> crmd[4949]: info: do_cib_control: CIB connection established >>>>>>>> crmd[4949]: info: get_cluster_type: Cluster type is: 'openais' >>>>>>>> crmd[4949]: notice: crm_cluster_connect: Connecting to cluster >>>>>>>> infrastructure: classic openais (with plugin) >>>>>>>> crmd[4949]: info: init_ais_connection_classic: Creating connection >>>>>>>> to our Corosync plugin >>>>>>>> stonith-ng[4938]: notice: setup_cib: Watching for stonith topology >>>>>>>> changes >>>>>>>> stonith-ng[4938]: info: main: Starting stonith-ng mainloop >>>>>>>> cib[4937]: notice: cib_server_process_diff: Not applying diff >>>>>>>> 1.249.29 -> 1.249.30 (sync in progress) >>>>>>>> crmd[4942]: info: init_ais_connection_classic: AIS connection >>>>>>>> established >>>>>>>> crmd[4942]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>>>> uname=node03 cname=pcmk >>>>>>>> crmd[4942]: info: init_ais_connection_once: Connection to 'classic >>>>>>>> openais (with plugin)': established >>>>>>>> crmd[4942]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>>>> crmd[4942]: info: crm_new_peer: Node 1003428268 is now known as >>>>>>>> node03 >>>>>>>> crmd[4942]: info: ais_status_callback: status: node03 is now >>>>>>>> unknown >>>>>>>> crmd[4942]: info: do_ha_control: Connected to the cluster >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 1 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: init_ais_connection_classic: AIS connection >>>>>>>> established >>>>>>>> crmd[4949]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>>>> uname=node03 cname=pcmk >>>>>>>> crmd[4949]: info: init_ais_connection_once: Connection to 'classic >>>>>>>> openais (with plugin)': established >>>>>>>> crmd[4942]: notice: ais_dispatch_message: Membership 6896: quorum >>>>>>>> acquired >>>>>>>> crmd[4949]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>>>> crmd[4949]: info: crm_new_peer: Node 1003428268 is now known as >>>>>>>> node03 >>>>>>>> crmd[4942]: info: crm_new_peer: Node node01 now has id: 969873836 >>>>>>>> crmd[4949]: info: ais_status_callback: status: node03 is now >>>>>>>> unknown >>>>>>>> crmd[4942]: info: crm_new_peer: Node 969873836 is now known as >>>>>>>> node01 >>>>>>>> crmd[4949]: info: do_ha_control: Connected to the cluster >>>>>>>> crmd[4942]: info: ais_status_callback: status: node01 is now >>>>>>>> unknown >>>>>>>> crmd[4942]: info: ais_status_callback: status: node01 is now >>>>>>>> member (was unknown) >>>>>>>> crmd[4942]: info: crm_update_peer: Node node01: id=969873836 >>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57) votes=1 born=6848 >>>>>>>> seen=6896 proc=00000000000000000000000000111312 >>>>>>>> crmd[4942]: info: crm_new_peer: Node node02 now has id: 986651052 >>>>>>>> crmd[4942]: info: crm_new_peer: Node 986651052 is now known as >>>>>>>> node02 >>>>>>>> crmd[4942]: info: ais_status_callback: status: node02 is now >>>>>>>> unknown >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 1 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: ais_status_callback: status: node02 is now >>>>>>>> member (was unknown) >>>>>>>> crmd[4942]: info: crm_update_peer: Node node02: id=986651052 >>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58) votes=1 born=6812 >>>>>>>> seen=6896 proc=00000000000000000000000000111312 >>>>>>>> crmd[4942]: notice: crmd_peer_update: Status update: Client >>>>>>>> node03/crmd now has status [online] (DC=<null>) >>>>>>>> crmd[4942]: info: ais_status_callback: status: node03 is now >>>>>>>> member (was unknown) >>>>>>>> crmd[4942]: info: crm_update_peer: Node node03: id=1003428268 >>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59) (new) votes=1 (new) >>>>>>>> born=6896 seen=6896 proc=00000000000000000000000000111312 (new) >>>>>>>> crmd[4942]: info: ais_dispatch_message: Membership 6896: quorum >>>>>>>> retained >>>>>>>> cib[4937]: notice: cib_server_process_diff: Not applying diff >>>>>>>> 1.249.30 -> 1.249.31 (sync in progress) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 2 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 3 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 2 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: notice: ais_dispatch_message: Membership 6896: quorum >>>>>>>> acquired >>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4937 due to >>>>>>>> rate-limiting >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 4 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 5 >>>>>>>> (30 max) times >>>>>>>> pengine[4948]: info: main: Starting pengine >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 3 >>>>>>>> (30 max) times >>>>>>>> attrd[4940]: info: cib_connect: Connected to the CIB after 1 >>>>>>>> signon attempts >>>>>>>> attrd[4940]: info: cib_connect: Sending full refresh >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 7 >>>>>>>> (30 max) times >>>>>>>> attrd[4947]: info: cib_connect: Connected to the CIB after 1 >>>>>>>> signon attempts >>>>>>>> attrd[4947]: info: cib_connect: Sending full refresh >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 4 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 8 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 5 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 9 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 6 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 10 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 7 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 11 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 8 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 12 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 9 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 13 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 10 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 14 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 11 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 12 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 15 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 13 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 16 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 14 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 17 >>>>>>>> (30 max) times >>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 15 >>>>>>>> (30 max) times >>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just >>>>>>>> popped (2000ms) >>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 18 >>>>>>>> (30 max) times >>>>>>>> >>>>>>>> >>>>>>>> We have the following components installed.. >>>>>>>> >>>>>>>> >>>>>>>> corosynclib-1.4.1-15.el6.x86_64 >>>>>>>> corosync-1.4.1-15.el6.x86_64 >>>>>>>> cluster-glue-libs-1.0.5-6.el6.x86_64 >>>>>>>> clusterlib-3.0.12.1-49.el6.x86_64 >>>>>>>> pacemaker-cluster-libs-1.1.7-6.el6.x86_64 >>>>>>>> cluster-glue-1.0.5-6.el6.x86_64 >>>>>>>> resource-agents-3.9.2-12.el6.x86_64 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We'd appreciate assistance on how to debug what the issue may be and >>>>>>>> some possible causes. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jimmy >>>>>>>> _______________________________________________ >>>>>>>> Linux-HA mailing list >>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linux-HA mailing list >>>>>>> Linux-HA@lists.linux-ha.org >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>> >>>>>> _______________________________________________ >>>>>> Linux-HA mailing list >>>>>> Linux-HA@lists.linux-ha.org >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>> See also: http://linux-ha.org/ReportingProblems >>>>> >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> Linux-HA@lists.linux-ha.org >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>> >>> >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems