maybe you can use sar for checking if your server was tight of resources? Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
2015-01-26 18:22 GMT+01:00 Oscar Salvador <osalvador.vilard...@gmail.com>: > Oh, I forgot some important details: > > root# (S) crm status > ============ > Last updated: Mon Jan 26 18:21:35 2015 > Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01 > Stack: Heartbeat > Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with > quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 2 Nodes configured, unknown expected votes > 8 Resources configured. > ============ > > Online: [ lb01 lb02 ] > > IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02 > IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02 > IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02 > IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02 > IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02 > IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02 > Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02 > Nginx-rsc (ocf::heartbeat:nginx): Started lb02 > > > This is running on: > > Debian 7.8 > pacemaker 1.1.7-1 > > 2015-01-26 18:20 GMT+01:00 Oscar Salvador <osalvador.vilard...@gmail.com>: >> >> Hi! >> >> I'm writing here because two days ago I experienced a strange problem in >> my Pacemaker Cluster. >> Everything was working fine, till suddenly a Segfault in Nginx monitor >> resource happened: >> >> Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551 >> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, >> Source=/var/lib/pengine/pe-input-90.bz2): Complete >> Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State >> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >> cause=C_FSA_INTERNAL origin=notify_crmd ] >> Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations >> (0.00us average, 0% utilization) in the last 10min >> Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck >> Timer (I_PE_CALC) just popped (900000ms) >> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State >> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED >> origin=crm_timer_popped ] >> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed >> to state S_POLICY_ENGINE after C_TIMER_POPPED >> Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing >> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) >> Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness: >> Ldirector-rsc can fail 999997 more times on lb02 before being forced off >> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State >> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS >> cause=C_IPC_MESSAGE origin=handle_response ] >> Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: >> Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2 >> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph >> 7552 (ref=pe_calc-dc-1422155424-7644) derived from >> /var/lib/pengine/pe-input-90.bz2 >> Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552 >> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, >> Source=/var/lib/pengine/pe-input-90.bz2): Complete >> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State >> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >> cause=C_FSA_INTERNAL origin=notify_crmd ] >> >> >> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: >> (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts >> >> As you can see, the last line. >> And then: >> >> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: >> (Nginx-rsc:monitor:stderr) Killed >> /usr/lib/ocf/resource.d//heartbeat/nginx: 910: >> /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork >> >> I guess here Nginx was killed. >> >> And then I have some others errors till Pacemaker decide to move the >> resources to the node: >> >> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation >> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false) >> invalid parameter >> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected >> action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552 >> Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: >> process_graph_event:476 - Triggered transition abort (complete=1, >> tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0, >> magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib= >> 3.14.40) : Old event >> Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating >> failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++, >> time=1422155430) >> Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State >> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL >> origin=abort_transition_graph ] >> Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile >> /var/log/ha-log >> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending >> flush op to all hosts for: fail-count-Nginx-rsc (1) >> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing >> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid >> parameter' (rc=2) >> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing >> failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2) >> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing >> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness: >> Ldirector-rsc can fail 999997 more times on lb02 before being forced off >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop >> IP-rsc_mysql (lb02) >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop >> IP-rsc_nginx (lb02) >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop >> IP-rsc_nginx6 (lb02) >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop >> IP-rsc_elasticsearch (lb02) >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move >> Ldirector-rsc (Started lb02 -> lb01) >> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move >> Nginx-rsc (Started lb02 -> lb01) >> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent >> update 23: fail-count-Nginx-rsc=1 >> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending >> flush op to all hosts for: last-failure-Nginx-rsc (1422155430) >> >> I see that Pacemaker is complaining about some errors like "invalid >> paraemter", for example in these lines: >> >> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation >> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false) >> invalid parameter >> >> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing >> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid >> parameter' (rc=2) >> >> It sounds(for me) like a syntax problem defining the resources, but I've >> checked the confic with crm_verify and there is no error: >> >> root# (S) crm_verify -LVV >> root# (S) >> >> So I'm just wondering why pacemaker is complaining about an invalid >> parameter. >> >> This is my CIB objetcs: >> >> node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01 >> node $id="68328520-68e0-42fd-9adf-062655691643" lb02 >> primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \ >> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224" >> primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \ >> params ipv6addr="xxxxxxxxxxxxxxxx" \ >> op monitor interval="10s" >> primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \ >> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224" >> primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \ >> params ipv6addr="xxxxxxxxxxxxxx" \ >> op monitor interval="10s" >> primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \ >> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224" >> primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \ >> params ipv6addr="xxxxxxxxxxxxxx" \ >> op monitor interval="10s" >> primitive Ldirector-rsc ocf:heartbeat:ldirectord \ >> op monitor interval="10s" timeout="30s" >> primitive Nginx-rsc ocf:heartbeat:nginx \ >> op monitor interval="10s" timeout="30s" >> location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \ >> rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01 >> location cli-standby-IP-rsc_mysql IP-rsc_mysql \ >> rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01 >> location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \ >> rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01 >> location cli-standby-IP-rsc_nginx IP-rsc_nginx \ >> rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01 >> location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \ >> rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01 >> colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx >> IP-rsc_nginx6 IP-rsc_elasticsearch >> order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc >> Nginx-rsc IP-rsc_elasticsearch >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ >> cluster-infrastructure="Heartbeat" \ >> stonith-enabled="false >> >> >> Do you have some hints that I can follow? >> >> Thanks in advance! >> >> Oscar > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org