Hello, I have a problem with pacemaker cluster (3 nodes, SAP production environment)
Node 1 Feb 11 12:00:39 s-xxx-05 lrmd: [12995]: info: operation monitor[85] on ip_wd_WIC_pri for client 12998: pid 27282 exited with return code 0 Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory Feb 11 12:01:16 s-xxx-05 crmd: [12998]: info: process_lrm_event: LRM operation ipbck_wd_WIC_pri_monitor_10000 (call=87, rc=7, cib-update=105, confirmed=false) not running Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_ais_dispatch: Update relayed from s-xxx-06 Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ipbck_wd_WIC_pri (1) Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_perform_update: Sent update 28: fail-count-ipbck_wd_WIC_pri=1 Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_ais_dispatch: Update relayed from s-xxx-06 Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ipbck_wd_WIC_pri (1392116476) Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_perform_update: Sent update 31: last-failure-ipbck_wd_WIC_pri=1392116476 Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: perform_ra_op::3123: fork: Cannot allocate memory Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: unable to perform_ra_op on operation monitor[14] on usrsap_WBW_pri:2 for client 12998, its parameters: CRM_meta_record_pending=[false] CRM_meta_clone=[2] fstype=[ocfs2] device=[/dev/sapBWPvg/sapWBW] CRM_meta_clone_node_max=[1] CRM_meta_notify=[false] CRM_meta_clone_max=[3] CRM_meta_globally_unique=[false] crm_feature_set=[3.0.6] directory=[/usr/sap/WBW] CRM_meta_name=[monitor] CRM_meta_interval=[60000] CRM_meta_timeout=[60000] Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: perform_ra_op::3123: fork: Cannot allocate memory Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: unable to perform_ra_op on operation stop[95] on webdisp_WIC_pri for client 12998, its parameters: CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_record_pending=[false] CRM_meta_timeout=[300000] InstanceName=[WIC_W39_vsicpwd] START_PROFILE=[/sapmnt/WIC/profile/WIC_W39_vsicpwd] Node 2 Feb 11 12:00:17 s-xxx-06 pengine: [10338]: notice: process_pe_message: Transition 3196: PEngine Input stored in: /var/lib/pengine/pe-input-476.bz2 Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: process_graph_event: Detected action ipbck_wd_WIC_pri_monitor_10000 from a different transition: 2546 vs. 3196 Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ipbck_wd_WIC_pri_last_failure_0, magic=0:7;321:2546:0:8544b0c8-b0fd-4249-a6ad-0ca818ba5f67, cib=0.1910.325) : Old event Feb 11 12:01:16 s-xxx-06 crmd: [10339]: WARN: update_failcount: Updating failcount for ipbck_wd_WIC_pri on s-xxx-05 after failed monitor: rc=7 (update=value++, time=1392116476) Feb 11 12:01:16 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-fail-count-ipbck_wd_WIC_pri, name=fail-count-ipbck_wd_WIC_pri, value=1, magic=NA, cib=0.1910.326) : Transient attribute: update Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-last-failure-ipbck_wd_WIC_pri, name=last-failure-ipbck_wd_WIC_pri, value=1392116476, magic=NA, cib=0.1910.327) : Transient attribute: update Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: unpack_config: On loss of CCM Quorum: Ignore Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:1_last_failure_0 on s-xxx-04: unknown exec error (-2) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:2_last_failure_0 on s-xxx-05: unknown exec error (-2) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op ipbck_wd_WIC_pri_last_failure_0 on s-xxx-05: not running (7) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ipbck_wd_WIC_pri can fail 4 more times on s-xxx-05 before being forced off Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Recover ipbck_wd_WIC_pri (Started s-xxx-05) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Restart ascs_ICP_pri (Started s-xxx-05) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Restart webdisp_WIC_pri (Started s-xxx-05) Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: do_te_invoke: Processing graph 3197 (ref=pe_calc-dc-1392116477-4106) derived from /var/lib/pengine/pe-input-477.bz2 Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: te_rsc_command: Initiating action 414: stop webdisp_WIC_pri_stop_0 on s-xxx-05 Feb 11 12:01:17 s-xxx-06 crmd: [10339]: WARN: status_from_rc: Action 414 (webdisp_WIC_pri_stop_0) on s-xxx-05 failed (target: 0 vs. rc: -2): Error Feb 11 12:01:17 s-xxx-06 crmd: [10339]: WARN: update_failcount: Updating failcount for webdisp_WIC_pri on s-xxx-05 after failed stop: rc=-2 (update=INFINITY, time=1392116477) Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=webdisp_WIC_pri_last_failure_0, magic=4:-2;414:3197:0:8544b0c8-b0fd-4249-a6ad-0ca818ba5f67, cib=0.1910.328) : Event failed Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: run_graph: ==== Transition 3197 (Complete=2, Pending=0, Fired=0, Skipped=11, Incomplete=0, Source=/var/lib/pengine/pe-input-477.bz2): Stopped Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-fail-count-webdisp_WIC_pri, name=fail-count-webdisp_WIC_pri, value=INFINITY, magic=NA, cib=0.1910.329) : Transient attribute: update Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-last-failure-webdisp_WIC_pri, name=last-failure-webdisp_WIC_pri, value=1392116477, magic=NA, cib=0.1910.330) : Transient attribute: update Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: process_pe_message: Transition 3197: PEngine Input stored in: /var/lib/pengine/pe-input-477.bz2 Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: unpack_config: On loss of CCM Quorum: Ignore Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:1_last_failure_0 on s-xxx-04: unknown exec error (-2) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:2_last_failure_0 on s-xxx-05: unknown exec error (-2) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op webdisp_WIC_pri_last_failure_0 on s-xxx-05: unknown exec error (-2) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: pe_fence_node: Node s-xxx-05 will be fenced to recover from resource failure(s) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op ipbck_wd_WIC_pri_last_failure_0 on s-xxx-05: not running (7) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off . . Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move ipbck_wd_WIC_pri (Started s-xxx-05 -> s-xxx-04) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move ascs_ICP_pri (Started s-xxx-05 -> s-xxx-04) Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move webdisp_WIC_pri (Started s-xxx-05 -> s-xxx-04) Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: do_te_invoke: Processing graph 3198 (ref=pe_calc-dc-1392116477-4108) derived from /var/lib/pengine/pe-warn-26.bz2 Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: te_fence_node: Executing reboot fencing operation (464) on s-xxx-05 (timeout=12000) Feb 11 12:01:17 s-xxx-06 stonith-ng: [10335]: info: initiate_remote_stonith_op: Initiating remote operation reboot for s-xxx-05: fff269bd-70f1-490b-a46f-92f2eaaa04f1 Feb 11 12:01:18 s-xxx-06 pengine: [10338]: WARN: process_pe_message: Transition 3198: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-26.bz2 Feb 11 12:01:18 s-xxx-06 pengine: [10338]: notice: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: Refreshing port list for stonith-sbd_pri Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: stonith-sbd_pri can fence s-xxx-05: dynamic-list Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: call_remote_stonith: Requesting that s-xxx-06 perform op reboot s-xxx-05 Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: stonith-sbd_pri can fence s-xxx-05: dynamic-list Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: stonith_fence: Found 1 matching devices for 's-xxx-05' Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: stonith_command: Processed st_fence from s-xxx-06: rc=-1 Feb 11 12:01:18 s-xxx-06 sbd: [25130]: info: Delivery process handling /dev/mapper/SBD_LUN_QUORUM Feb 11 12:01:18 s-xxx-06 sbd: [25130]: info: Writing reset to node slot s-xxx-05 Node 3 Feb 11 12:00:01 s-xxx-04 /usr/sbin/cron[22525]: (root) CMD ([ -x /usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 -S ALL 1 1) Feb 11 12:00:01 s-xxx-04 syslog-ng[4795]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=11361', processed='center(received)=6355', processed='destination(messages)=1462', processed='destination(mailinfo)=4893', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=103', processed='destination(console)=5', processed='destination(null)=0', processed='destination(mail)=4893', processed='destination(xconsole)=5', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=6355' Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: crm_new_peer: Node s-xxx-06 now has id: 101344266 Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: crm_new_peer: Node 101344266 is now known as s-xxx-06 Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: stonith_command: Processed st_query from s-xxx-06: rc=0 Feb 11 12:01:23 s-xxx-04 corosync[12944]: [TOTEM ] A processor failed, forming new configuration. Feb 11 12:01:29 s-xxx-04 corosync[12944]: [CLM ] CLM CONFIGURATION CHANGE Can this error "Cannot allocate memory" to indicate that there cannot be any memory allocated for a new Resource Agent instance ? I have 128Gb of RAM THP is setting to never Version: openais-1.1.4-5.8.7.1 libopenais3-1.1.4-5.8.7.1 pacemaker-mgmt-2.1.1-0.6.2.17 pacemaker-1.1.7-0.13.9 drbd-pacemaker-8.4.2-0.6.6.7 pacemaker-mgmt-client-2.1.1-0.6.2.17 libpacemaker3-1.1.7-0.13.9 OS Sles 11 SP2 kernel 3.0.80-0.7-default Ask me for more information ? Thanks Bye Walter
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org