Hello,

I have a problem with pacemaker cluster (3 nodes, SAP production environment)

Node 1

Feb 11 12:00:39 s-xxx-05 lrmd: [12995]: info: operation monitor[85] on 
ip_wd_WIC_pri for client 12998: pid 27282 exited with return code 0
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: 
(ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 
fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: 
(ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 
fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: 
(ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 
fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: 
(ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 
fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 crmd: [12998]: info: process_lrm_event: LRM operation 
ipbck_wd_WIC_pri_monitor_10000 (call=87, rc=7, cib-update=105, confirmed=false) 
not running
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_ais_dispatch: Update 
relayed from s-xxx-06
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-ipbck_wd_WIC_pri (1)
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_perform_update: Sent 
update 28: fail-count-ipbck_wd_WIC_pri=1
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_ais_dispatch: Update 
relayed from s-xxx-06
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-ipbck_wd_WIC_pri (1392116476)
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_perform_update: Sent 
update 31: last-failure-ipbck_wd_WIC_pri=1392116476
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: perform_ra_op::3123: fork: 
Cannot allocate memory
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: unable to perform_ra_op on 
operation monitor[14] on usrsap_WBW_pri:2 for client 12998, its parameters: 
CRM_meta_record_pending=[false] CRM_meta_clone=[2] fstype=[ocfs2] 
device=[/dev/sapBWPvg/sapWBW] CRM_meta_clone_node_max=[1] 
CRM_meta_notify=[false] CRM_meta_clone_max=[3] CRM_meta_globally_unique=[false] 
crm_feature_set=[3.0.6] directory=[/usr/sap/WBW] CRM_meta_name=[monitor] 
CRM_meta_interval=[60000] CRM_meta_timeout=[60000]
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: perform_ra_op::3123: fork: 
Cannot allocate memory
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: unable to perform_ra_op on 
operation stop[95] on webdisp_WIC_pri for client 12998, its parameters: 
CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_record_pending=[false] 
CRM_meta_timeout=[300000] InstanceName=[WIC_W39_vsicpwd] 
START_PROFILE=[/sapmnt/WIC/profile/WIC_W39_vsicpwd]

Node 2

Feb 11 12:00:17 s-xxx-06 pengine: [10338]: notice: process_pe_message: 
Transition 3196: PEngine Input stored in: /var/lib/pengine/pe-input-476.bz2
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: process_graph_event: Detected 
action ipbck_wd_WIC_pri_monitor_10000 from a different transition: 2546 vs. 3196
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: 
process_graph_event:476 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=ipbck_wd_WIC_pri_last_failure_0, 
magic=0:7;321:2546:0:8544b0c8-b0fd-4249-a6ad-0ca818ba5f67, cib=0.1910.325) : 
Old event
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: WARN: update_failcount: Updating 
failcount for ipbck_wd_WIC_pri on s-xxx-05 after failed monitor: rc=7 
(update=value++, time=1392116476)
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-s-xxx-05-fail-count-ipbck_wd_WIC_pri, 
name=fail-count-ipbck_wd_WIC_pri, value=1, magic=NA, cib=0.1910.326) : 
Transient attribute: update
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-s-xxx-05-last-failure-ipbck_wd_WIC_pri, 
name=last-failure-ipbck_wd_WIC_pri, value=1392116476, magic=NA, cib=0.1910.327) 
: Transient attribute: update
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: unpack_config: On loss of 
CCM Quorum: Ignore
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_nodes: Blind faith: not 
fencing unseen nodes
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op sapmnt_ICP_pri:1_last_failure_0 on s-xxx-04: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op sapmnt_ICP_pri:2_last_failure_0 on s-xxx-05: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op ipbck_wd_WIC_pri_last_failure_0 on s-xxx-05: not running (7)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ipbck_wd_WIC_pri can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Recover 
ipbck_wd_WIC_pri      (Started s-xxx-05)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Restart 
ascs_ICP_pri  (Started s-xxx-05)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Restart 
webdisp_WIC_pri       (Started s-xxx-05)
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: do_te_invoke: Processing graph 
3197 (ref=pe_calc-dc-1392116477-4106) derived from 
/var/lib/pengine/pe-input-477.bz2
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: te_rsc_command: Initiating action 
414: stop webdisp_WIC_pri_stop_0 on s-xxx-05
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: WARN: status_from_rc: Action 414 
(webdisp_WIC_pri_stop_0) on s-xxx-05 failed (target: 0 vs. rc: -2): Error
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: WARN: update_failcount: Updating 
failcount for webdisp_WIC_pri on s-xxx-05 after failed stop: rc=-2 
(update=INFINITY, time=1392116477)
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: 
match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, 
id=webdisp_WIC_pri_last_failure_0, 
magic=4:-2;414:3197:0:8544b0c8-b0fd-4249-a6ad-0ca818ba5f67, cib=0.1910.328) : 
Event failed
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: run_graph: ==== Transition 3197 
(Complete=2, Pending=0, Fired=0, Skipped=11, Incomplete=0, 
Source=/var/lib/pengine/pe-input-477.bz2): Stopped
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-s-xxx-05-fail-count-webdisp_WIC_pri, name=fail-count-webdisp_WIC_pri, 
value=INFINITY, magic=NA, cib=0.1910.329) : Transient attribute: update
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-s-xxx-05-last-failure-webdisp_WIC_pri, 
name=last-failure-webdisp_WIC_pri, value=1392116477, magic=NA, cib=0.1910.330) 
: Transient attribute: update
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: process_pe_message: 
Transition 3197: PEngine Input stored in: /var/lib/pengine/pe-input-477.bz2
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: unpack_config: On loss of 
CCM Quorum: Ignore
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_nodes: Blind faith: not 
fencing unseen nodes
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op sapmnt_ICP_pri:1_last_failure_0 on s-xxx-04: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op sapmnt_ICP_pri:2_last_failure_0 on s-xxx-05: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op webdisp_WIC_pri_last_failure_0 on s-xxx-05: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: pe_fence_node: Node s-xxx-05 
will be fenced to recover from resource failure(s)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing 
failed op ipbck_wd_WIC_pri_last_failure_0 on s-xxx-05: not running (7)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: 
ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
.
.
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move    
ipbck_wd_WIC_pri      (Started s-xxx-05 -> s-xxx-04)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move    
ascs_ICP_pri  (Started s-xxx-05 -> s-xxx-04)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move    
webdisp_WIC_pri       (Started s-xxx-05 -> s-xxx-04)
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: do_te_invoke: Processing graph 
3198 (ref=pe_calc-dc-1392116477-4108) derived from 
/var/lib/pengine/pe-warn-26.bz2
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: te_fence_node: Executing reboot 
fencing operation (464) on s-xxx-05 (timeout=12000)
Feb 11 12:01:17 s-xxx-06 stonith-ng: [10335]: info: initiate_remote_stonith_op: 
Initiating remote operation reboot for s-xxx-05: 
fff269bd-70f1-490b-a46f-92f2eaaa04f1
Feb 11 12:01:18 s-xxx-06 pengine: [10338]: WARN: process_pe_message: Transition 
3198: WARNINGs found during PE processing. PEngine Input stored in: 
/var/lib/pengine/pe-warn-26.bz2
Feb 11 12:01:18 s-xxx-06 pengine: [10338]: notice: process_pe_message: 
Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" 
to identify issues.
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: 
Refreshing port list for stonith-sbd_pri
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: 
stonith-sbd_pri can fence s-xxx-05: dynamic-list
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: call_remote_stonith: 
Requesting that s-xxx-06 perform op reboot s-xxx-05
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: 
stonith-sbd_pri can fence s-xxx-05: dynamic-list
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: stonith_fence: Found 1 
matching devices for 's-xxx-05'
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: stonith_command: Processed 
st_fence from s-xxx-06: rc=-1
Feb 11 12:01:18 s-xxx-06 sbd: [25130]: info: Delivery process handling 
/dev/mapper/SBD_LUN_QUORUM
Feb 11 12:01:18 s-xxx-06 sbd: [25130]: info: Writing reset to node slot s-xxx-05


Node 3

Feb 11 12:00:01 s-xxx-04 /usr/sbin/cron[22525]: (root) CMD ([ -x 
/usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 -S ALL 1 1)
Feb 11 12:00:01 s-xxx-04 syslog-ng[4795]: Log statistics; 
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', 
processed='center(queued)=11361', processed='center(received)=6355', 
processed='destination(messages)=1462', processed='destination(mailinfo)=4893', 
processed='destination(mailwarn)=0', processed='destination(localmessages)=0', 
processed='destination(newserr)=0', processed='destination(mailerr)=0', 
processed='destination(netmgm)=0', processed='destination(warn)=103', 
processed='destination(console)=5', processed='destination(null)=0', 
processed='destination(mail)=4893', processed='destination(xconsole)=5', 
processed='destination(firewall)=0', processed='destination(acpid)=0', 
processed='destination(newscrit)=0', processed='destination(newsnotice)=0', 
processed='source(src)=6355'
Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: crm_new_peer: Node s-xxx-06 
now has id: 101344266
Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: crm_new_peer: Node 
101344266 is now known as s-xxx-06
Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: stonith_command: Processed 
st_query from s-xxx-06: rc=0
Feb 11 12:01:23 s-xxx-04 corosync[12944]:  [TOTEM ] A processor failed, forming 
new configuration.
Feb 11 12:01:29 s-xxx-04 corosync[12944]:  [CLM   ] CLM CONFIGURATION CHANGE



Can this error "Cannot allocate memory" to indicate that there cannot be any 
memory allocated for a new Resource Agent instance ?

I have 128Gb of RAM

THP is setting to never


Version:

openais-1.1.4-5.8.7.1
libopenais3-1.1.4-5.8.7.1
pacemaker-mgmt-2.1.1-0.6.2.17
pacemaker-1.1.7-0.13.9
drbd-pacemaker-8.4.2-0.6.6.7
pacemaker-mgmt-client-2.1.1-0.6.2.17
libpacemaker3-1.1.7-0.13.9

OS Sles 11 SP2 kernel 3.0.80-0.7-default

Ask me for more information ?


Thanks

Bye
Walter


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to