On 03/16/2016 11:20 AM, Lorand Kelemen wrote: > Dear Ken, > > I already modified the startup as suggested during testing, thanks! I > swapped the postfix ocf resource to the amavisd systemd resource, as latter > controls postfix startup also as it turns out and having both resouces in > the mail-services group causes conflicts (postfix is detected as not > running). > > Still experiencing the same behaviour, killing amavisd returns an rc=7 for > the monitoring operation on the "victim" node, this soungs logical, but the > logs contain the same: amavisd and virtualip cannot run anywhere. > > I made sure systemd is clean (amavisd = inactive, not running instead of > failed) and also reset the failcount on all resources before killing > amavisd. > > How can I make sure to have a clean state for the resources beside above > actions?
What you did is fine. I'm not sure why amavisd and virtualip can't run. Can you show the output of "cibadmin -Q" when the cluster is in that state? > Also note: when causing a filesystem resource to fail (e.g. with unmout), > the failover happens successfully, all resources are started on the > "survivor" node. > > Best regards, > Lorand > > > On Wed, Mar 16, 2016 at 4:34 PM, Ken Gaillot <kgail...@redhat.com> wrote: > >> On 03/16/2016 05:49 AM, Lorand Kelemen wrote: >>> Dear Ken, >>> >>> Thanks for the reply! I lowered migration-threshold to 1 and rearranged >>> contraints like you suggested: >>> >>> Location Constraints: >>> Ordering Constraints: >>> promote mail-clone then start fs-services (kind:Mandatory) >>> promote spool-clone then start fs-services (kind:Mandatory) >>> start fs-services then start network-services (kind:Mandatory) >> >> Certainly not a big deal, but I would change the above constraint to >> start fs-services then start mail-services. The IP doesn't care whether >> the filesystems are up yet or not, but postfix does. >> >>> start network-services then start mail-services (kind:Mandatory) >>> Colocation Constraints: >>> fs-services with spool-clone (score:INFINITY) (rsc-role:Started) >>> (with-rsc-role:Master) >>> fs-services with mail-clone (score:INFINITY) (rsc-role:Started) >>> (with-rsc-role:Master) >>> network-services with mail-services (score:INFINITY) >>> mail-services with fs-services (score:INFINITY) >>> >>> Now virtualip and postfix becomes stopped, I guess these are relevant >> but I >>> attach also full logs: >>> >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_color: Resource postfix cannot run anywhere >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_color: Resource virtualip-1 cannot run anywhere >>> >>> Interesting, will try to play around with ordering - colocation, the >>> solution must be in these settings... >>> >>> Best regards, >>> Lorand >>> >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: --- 0.215.7 2 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: +++ 0.215.8 (null) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + /cib: @num_updates=8 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: ++ >>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']: >>> <lrm_rsc_op id="postfix_last_failure_0" >>> operation_key="postfix_monitor_45000" operation="monitor" >>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" >>> transition-key="86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a" >>> transition-magic="0:7;86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a" >>> on_node="mail1" call-id="1333" rc-code="7" >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> abort_transition_graph: Transition aborted by postfix_monitor_45000 >>> 'create' on mail1: Inactive graph >>> (magic=0:7;86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.215.8, >>> source=process_graph_event:598, 1) >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> update_failcount: Updating failcount for postfix on mail1 after >> failed >>> monitor: rc=7 (update=value++, time=1458124686) >> >> I don't think your constraints are causing problems now; the above >> message indicates that the postfix resource failed. Postfix may not be >> able to run anywhere because it's already failed on both nodes, and the >> IP would be down because it has to be colocated with postfix, and >> postfix can't run. >> >> The rc=7 above indicates that the postfix agent's monitor operation >> returned 7, which is "not running". I'd check the logs for postfix errors. >> >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> process_graph_event: Detected action (2962.86) >>> postfix_monitor_45000.1333=not running: failed >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_client_update: Expanded fail-count-postfix=value++ to 1 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Completed cib_modify operation for section status: >> OK >>> (rc=0, origin=mail1/crmd/253, version=0.215.8) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_peer_update: Setting fail-count-postfix[mail1]: (null) -> 1 from >>> mail2 >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: notice: >>> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ >>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> write_attribute: Sent update 406 with 2 changes for >>> fail-count-postfix, id=<n/a>, set=(null) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_peer_update: Setting last-failure-postfix[mail1]: 1458124291 -> >>> 1458124686 from mail2 >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> write_attribute: Sent update 407 with 2 changes for >>> last-failure-postfix, id=<n/a>, set=(null) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Forwarding cib_modify operation for section status >> to >>> master (origin=local/attrd/406) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Forwarding cib_modify operation for section status >> to >>> master (origin=local/attrd/407) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: --- 0.215.8 2 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: +++ 0.215.9 (null) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + /cib: @num_updates=9 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: ++ >>> >> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']: >>> <nvpair id="status-1-fail-count-postfix" name="fail-count-postfix" >>> value="1"/> >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Completed cib_modify operation for section status: >> OK >>> (rc=0, origin=mail2/attrd/406, version=0.215.9) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: --- 0.215.9 2 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: +++ 0.215.10 (null) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + /cib: @num_updates=10 >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + >>> >> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']: >>> @value=1458124686 >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_cib_callback: Update 406 for fail-count-postfix: OK (0) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_cib_callback: Update 406 for fail-count-postfix[mail1]=1: OK (0) >>> Mar 16 11:38:06 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Completed cib_modify operation for section status: >> OK >>> (rc=0, origin=mail2/attrd/407, version=0.215.10) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_cib_callback: Update 406 for fail-count-postfix[mail2]=(null): OK >>> (0) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_cib_callback: Update 407 for last-failure-postfix: OK (0) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_cib_callback: Update 407 for >>> last-failure-postfix[mail1]=1458124686: OK (0) >>> Mar 16 11:38:06 [7418] HWJ-626.domain.local attrd: info: >>> attrd_cib_callback: Update 407 for >>> last-failure-postfix[mail2]=1457610376: OK (0) >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> abort_transition_graph: Transition aborted by >>> status-1-fail-count-postfix, fail-count-postfix=1: Transient attribute >>> change (create cib=0.215.9, source=abort_unless_down:319, >>> >> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], >>> 1) >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> abort_transition_graph: Transition aborted by >>> status-1-last-failure-postfix, last-failure-postfix=1458124686: Transient >>> attribute change (modify cib=0.215.10, source=abort_unless_down:319, >>> >> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'], >>> 1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> unpack_config: On loss of CCM Quorum: Ignore >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_online_status: Node mail1 is online >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_online_status: Node mail2 is online >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource mail:0 active in >>> master mode on mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource spool:0 active in >>> master mode on mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-spool active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-spool active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-mail active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-mail active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: warning: >>> unpack_rsc_op_failure: Processing failed op monitor for postfix on >>> mail1: not running (7) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource spool:1 active in >>> master mode on mail2 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource mail:1 active in >>> master mode on mail2 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> group_print: Resource Group: network-services >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: virtualip-1 (ocf::heartbeat:IPaddr2): >> Started >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> clone_print: Master/Slave Set: spool-clone [spool] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Masters: [ mail1 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Slaves: [ mail2 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> clone_print: Master/Slave Set: mail-clone [mail] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Masters: [ mail1 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Slaves: [ mail2 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> group_print: Resource Group: fs-services >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: fs-spool (ocf::heartbeat:Filesystem): Started >> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: fs-mail (ocf::heartbeat:Filesystem): Started >> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> group_print: Resource Group: mail-services >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: postfix (ocf::heartbeat:postfix): FAILED >> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: Promoting mail:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: mail-clone: Promoted 1 instances of a possible 1 to master >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: Promoting spool:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: spool-clone: Promoted 1 instances of a possible 1 to master >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> RecurringOp: Start recurring monitor (45s) for postfix on mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave virtualip-1 (Started mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave spool:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave spool:1 (Slave mail2) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave mail:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave mail:1 (Slave mail2) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave fs-spool (Started mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave fs-mail (Started mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> LogActions: Recover postfix (Started mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> process_pe_message: Calculated Transition 2963: >>> /var/lib/pacemaker/pengine/pe-input-330.bz2 >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> handle_response: pe_calc calculation pe_calc-dc-1458124686-5541 is >>> obsolete >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> unpack_config: On loss of CCM Quorum: Ignore >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_online_status: Node mail1 is online >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_online_status: Node mail2 is online >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource mail:0 active in >>> master mode on mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource spool:0 active in >>> master mode on mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-spool active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-spool active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-mail active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource fs-mail active on >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: warning: >>> unpack_rsc_op_failure: Processing failed op monitor for postfix on >>> mail1: not running (7) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource spool:1 active in >>> master mode on mail2 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> determine_op_status: Operation monitor found resource mail:1 active in >>> master mode on mail2 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> group_print: Resource Group: network-services >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: virtualip-1 (ocf::heartbeat:IPaddr2): >> Started >>> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> clone_print: Master/Slave Set: spool-clone [spool] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Masters: [ mail1 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Slaves: [ mail2 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> clone_print: Master/Slave Set: mail-clone [mail] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Masters: [ mail1 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> short_print: Slaves: [ mail2 ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> group_print: Resource Group: fs-services >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: fs-spool (ocf::heartbeat:Filesystem): Started >> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: fs-mail (ocf::heartbeat:Filesystem): Started >> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> group_print: Resource Group: mail-services >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_print: postfix (ocf::heartbeat:postfix): FAILED >> mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> get_failcount_full: postfix has failed 1 times on mail1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: warning: >>> common_apply_stickiness: Forcing postfix away from mail1 after 1 >>> failures (max=1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: Promoting mail:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: mail-clone: Promoted 1 instances of a possible 1 to master >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: Promoting spool:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> master_color: spool-clone: Promoted 1 instances of a possible 1 to master >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> rsc_merge_weights: fs-mail: Rolling back scores from postfix >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> rsc_merge_weights: postfix: Rolling back scores from virtualip-1 >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_color: Resource postfix cannot run anywhere >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> native_color: Resource virtualip-1 cannot run anywhere >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> LogActions: Stop virtualip-1 (mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave spool:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave spool:1 (Slave mail2) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave mail:0 (Master mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave mail:1 (Slave mail2) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave fs-spool (Started mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: info: >>> LogActions: Leave fs-mail (Started mail1) >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> LogActions: Stop postfix (mail1) >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> do_state_transition: State transition S_POLICY_ENGINE -> >>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE >>> origin=handle_response ] >>> Mar 16 11:38:06 [7419] HWJ-626.domain.local pengine: notice: >>> process_pe_message: Calculated Transition 2964: >>> /var/lib/pacemaker/pengine/pe-input-331.bz2 >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: info: >>> do_te_invoke: Processing graph 2964 (ref=pe_calc-dc-1458124686-5542) >>> derived from /var/lib/pacemaker/pengine/pe-input-331.bz2 >>> Mar 16 11:38:06 [7420] HWJ-626.domain.local crmd: notice: >>> te_rsc_command: Initiating action 5: stop postfix_stop_0 on mail1 >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: --- 0.215.10 2 >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: +++ 0.215.11 (null) >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + /cib: @num_updates=11 >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + >>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']: >>> @operation_key=postfix_stop_0, @operation=stop, >>> @transition-key=5:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>> @transition-magic=0:0;5:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>> @call-id=1335, @last-run=1458124686, @last-rc-change=1458124686, >>> @exec-time=435 >>> Mar 16 11:38:07 [7420] HWJ-626.domain.local crmd: info: >>> match_graph_event: Action postfix_stop_0 (5) confirmed on mail1 (rc=0) >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Completed cib_modify operation for section status: >> OK >>> (rc=0, origin=mail1/crmd/254, version=0.215.11) >>> Mar 16 11:38:07 [7420] HWJ-626.domain.local crmd: notice: >>> te_rsc_command: Initiating action 12: stop virtualip-1_stop_0 on >> mail1 >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: --- 0.215.11 2 >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: Diff: +++ 0.215.12 (null) >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + /cib: @num_updates=12 >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_perform_op: + >>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']: >>> @operation_key=virtualip-1_stop_0, @operation=stop, >>> @transition-key=12:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>> @transition-magic=0:0;12:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>> @call-id=1337, @last-run=1458124687, @last-rc-change=1458124687, >>> @exec-time=56 >>> Mar 16 11:38:07 [7420] HWJ-626.domain.local crmd: info: >>> match_graph_event: Action virtualip-1_stop_0 (12) confirmed on mail1 >>> (rc=0) >>> Mar 16 11:38:07 [7415] HWJ-626.domain.local cib: info: >>> cib_process_request: Completed cib_modify operation for section status: >> OK >>> (rc=0, origin=mail1/crmd/255, version=0.215.12) >>> Mar 16 11:38:07 [7420] HWJ-626.domain.local crmd: notice: >>> run_graph: Transition 2964 (Complete=7, Pending=0, Fired=0, Skipped=0, >>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-331.bz2): >> Complete >>> Mar 16 11:38:07 [7420] HWJ-626.domain.local crmd: info: do_log: >>> FSA: Input I_TE_SUCCESS from notify_crmd() received in state >>> S_TRANSITION_ENGINE >>> Mar 16 11:38:07 [7420] HWJ-626.domain.local crmd: notice: >>> do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ >>> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] >>> Mar 16 11:38:12 [7415] HWJ-626.domain.local cib: info: >>> cib_process_ping: Reporting our current digest to mail2: >>> ed43bc3ecf0f15853900ba49fc514870 for 0.215.12 (0x152b110 0) >>> >>> >>> On Mon, Mar 14, 2016 at 6:44 PM, Ken Gaillot <kgail...@redhat.com> >> wrote: >>> >>>> On 03/10/2016 09:49 AM, Lorand Kelemen wrote: >>>>> Dear List, >>>>> >>>>> After the creation and testing of a simple 2 node active-passive >>>>> drbd+postfix cluster nearly everything works flawlessly (standby, >> failure >>>>> of a filesystem resource + failover, splitbrain + manual recovery) >>>> however >>>>> when delibarately killing the postfix instance, after reaching the >>>>> migration threshold failover does not occur and resources revert to the >>>>> Stopped state (except the master-slave drbd resource, which works as >>>>> expected). >>>>> >>>>> Ordering and colocation is configured, STONITH and quorum disabled, the >>>>> goal is to always have one node running all the resources and at any >> sign >>>>> of error it should fail over to the passive node, nothing fancy. >>>>> >>>>> Is my configuration wrong or am I hitting a bug? >>>>> >>>>> All software from centos 7 + elrepo repositories. >>>> >>>> With these versions, you can set "two_node: 1" in >>>> /etc/corosync/corosync.conf (which will be done automatically if you >>>> used "pcs cluster setup" initially), and then you don't need to ignore >>>> quorum in pacemaker. >>>> >>>>> Regarding STONITH: the machines are running on free ESXi instances on >>>>> separate machines, so the Vmware fencing agents won't work because in >> the >>>>> free version the API is read-only. >>>>> Still trying to figure out a way to go, until then manual recovery + >> huge >>>>> arp cache times on the upstream firewall... >>>>> >>>>> Please find pe-input*.bz files attached, logs and config below. The >>>>> situation: on node mail1 postfix was killed 3 times (migration >>>> threshold), >>>>> it should have failed over to mail2. >>>>> When killing a filesystem resource three times this happens flawlessly. >>>>> >>>>> Thanks for your input! >>>>> >>>>> Best regards, >>>>> Lorand >>>>> >>>>> >>>>> Cluster Name: mailcluster >>>>> Corosync Nodes: >>>>> mail1 mail2 >>>>> Pacemaker Nodes: >>>>> mail1 mail2 >>>>> >>>>> Resources: >>>>> Group: network-services >>>>> Resource: virtualip-1 (class=ocf provider=heartbeat type=IPaddr2) >>>>> Attributes: ip=10.20.64.10 cidr_netmask=24 nic=lan0 >>>>> Operations: start interval=0s timeout=20s >>>> (virtualip-1-start-interval-0s) >>>>> stop interval=0s timeout=20s >>>> (virtualip-1-stop-interval-0s) >>>>> monitor interval=30s (virtualip-1-monitor-interval-30s) >>>>> Master: spool-clone >>>>> Meta Attrs: master-max=1 master-node-max=1 clone-max=2 >> clone-node-max=1 >>>>> notify=true >>>>> Resource: spool (class=ocf provider=linbit type=drbd) >>>>> Attributes: drbd_resource=spool >>>>> Operations: start interval=0s timeout=240 (spool-start-interval-0s) >>>>> promote interval=0s timeout=90 >> (spool-promote-interval-0s) >>>>> demote interval=0s timeout=90 (spool-demote-interval-0s) >>>>> stop interval=0s timeout=100 (spool-stop-interval-0s) >>>>> monitor interval=10s (spool-monitor-interval-10s) >>>>> Master: mail-clone >>>>> Meta Attrs: master-max=1 master-node-max=1 clone-max=2 >> clone-node-max=1 >>>>> notify=true >>>>> Resource: mail (class=ocf provider=linbit type=drbd) >>>>> Attributes: drbd_resource=mail >>>>> Operations: start interval=0s timeout=240 (mail-start-interval-0s) >>>>> promote interval=0s timeout=90 >> (mail-promote-interval-0s) >>>>> demote interval=0s timeout=90 (mail-demote-interval-0s) >>>>> stop interval=0s timeout=100 (mail-stop-interval-0s) >>>>> monitor interval=10s (mail-monitor-interval-10s) >>>>> Group: fs-services >>>>> Resource: fs-spool (class=ocf provider=heartbeat type=Filesystem) >>>>> Attributes: device=/dev/drbd0 directory=/var/spool/postfix >> fstype=ext4 >>>>> options=nodev,nosuid,noexec >>>>> Operations: start interval=0s timeout=60 >> (fs-spool-start-interval-0s) >>>>> stop interval=0s timeout=60 (fs-spool-stop-interval-0s) >>>>> monitor interval=20 timeout=40 >>>> (fs-spool-monitor-interval-20) >>>>> Resource: fs-mail (class=ocf provider=heartbeat type=Filesystem) >>>>> Attributes: device=/dev/drbd1 directory=/var/spool/mail fstype=ext4 >>>>> options=nodev,nosuid,noexec >>>>> Operations: start interval=0s timeout=60 (fs-mail-start-interval-0s) >>>>> stop interval=0s timeout=60 (fs-mail-stop-interval-0s) >>>>> monitor interval=20 timeout=40 >>>> (fs-mail-monitor-interval-20) >>>>> Group: mail-services >>>>> Resource: postfix (class=ocf provider=heartbeat type=postfix) >>>>> Operations: start interval=0s timeout=20s >> (postfix-start-interval-0s) >>>>> stop interval=0s timeout=20s (postfix-stop-interval-0s) >>>>> monitor interval=45s (postfix-monitor-interval-45s) >>>>> >>>>> Stonith Devices: >>>>> Fencing Levels: >>>>> >>>>> Location Constraints: >>>>> Ordering Constraints: >>>>> start network-services then promote mail-clone (kind:Mandatory) >>>>> (id:order-network-services-mail-clone-mandatory) >>>>> promote mail-clone then promote spool-clone (kind:Mandatory) >>>>> (id:order-mail-clone-spool-clone-mandatory) >>>>> promote spool-clone then start fs-services (kind:Mandatory) >>>>> (id:order-spool-clone-fs-services-mandatory) >>>>> start fs-services then start mail-services (kind:Mandatory) >>>>> (id:order-fs-services-mail-services-mandatory) >>>>> Colocation Constraints: >>>>> network-services with spool-clone (score:INFINITY) (rsc-role:Started) >>>>> (with-rsc-role:Master) >>>> (id:colocation-network-services-spool-clone-INFINITY) >>>>> network-services with mail-clone (score:INFINITY) (rsc-role:Started) >>>>> (with-rsc-role:Master) >>>> (id:colocation-network-services-mail-clone-INFINITY) >>>>> network-services with fs-services (score:INFINITY) >>>>> (id:colocation-network-services-fs-services-INFINITY) >>>>> network-services with mail-services (score:INFINITY) >>>>> (id:colocation-network-services-mail-services-INFINITY) >>>> >>>> I'm not sure whether it's causing your issue, but I would make the >>>> constraints reflect the logical relationships better. >>>> >>>> For example, network-services only needs to be colocated with >>>> mail-services logically; it's mail-services that needs to be with >>>> fs-services, and fs-services that needs to be with >>>> spool-clone/mail-clone master. In other words, don't make the >>>> highest-level resource depend on everything else, make each level depend >>>> on the level below it. >>>> >>>> Also, I would guess that the virtual IP only needs to be ordered before >>>> mail-services, and mail-clone and spool-clone could both be ordered >>>> before fs-services, rather than ordering mail-clone before spool-clone. >>>> >>>>> Resources Defaults: >>>>> migration-threshold: 3 >>>>> Operations Defaults: >>>>> on-fail: restart >>>>> >>>>> Cluster Properties: >>>>> cluster-infrastructure: corosync >>>>> cluster-name: mailcluster >>>>> cluster-recheck-interval: 5min >>>>> dc-version: 1.1.13-10.el7_2.2-44eb2dd >>>>> default-resource-stickiness: infinity >>>>> have-watchdog: false >>>>> last-lrm-refresh: 1457613674 >>>>> no-quorum-policy: ignore >>>>> pe-error-series-max: 1024 >>>>> pe-input-series-max: 1024 >>>>> pe-warn-series-max: 1024 >>>>> stonith-enabled: false >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.15 2 >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.16 (null) >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=16 >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_failure_0']: >>>>> @transition-key=4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @transition-magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @call-id=1274, @last-rc-change=1457613440 >>>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local crmd: info: >>>>> abort_transition_graph: Transition aborted by >> postfix_monitor_45000 >>>>> 'modify' on mail1: Inactive graph >>>>> (magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.197.16, >>>>> source=process_graph_event:598, 1) >>>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local crmd: info: >>>>> update_failcount: Updating failcount for postfix on mail1 after >>>> failed >>>>> monitor: rc=7 (update=value++, time=1457613440) >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_client_update: Expanded fail-count-postfix=value++ to 3 >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail1/crmd/196, version=0.197.16) >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_peer_update: Setting fail-count-postfix[mail1]: 2 -> 3 from >>>> mail2 >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> write_attribute: Sent update 400 with 2 changes for >>>>> fail-count-postfix, id=<n/a>, set=(null) >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Forwarding cib_modify operation for section >> status >>>> to >>>>> master (origin=local/attrd/400) >>>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local crmd: info: >>>>> process_graph_event: Detected action (1234.4) >>>>> postfix_monitor_45000.1274=not running: failed >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_peer_update: Setting last-failure-postfix[mail1]: 1457613347 >> -> >>>>> 1457613440 from mail2 >>>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local crmd: notice: >>>>> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ >>>>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> write_attribute: Sent update 401 with 2 changes for >>>>> last-failure-postfix, id=<n/a>, set=(null) >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.16 2 >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.17 (null) >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=17 >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix']: >>>>> @value=3 >>>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail2/attrd/400, version=0.197.17) >>>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local crmd: info: >>>>> abort_transition_graph: Transition aborted by >>>>> status-1-fail-count-postfix, fail-count-postfix=3: Transient attribute >>>>> change (modify cib=0.197.17, source=abort_unless_down:319, >>>>> >>>> >> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix'], >>>>> 1) >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_cib_callback: Update 400 for fail-count-postfix: OK (0) >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_cib_callback: Update 400 for fail-count-postfix[mail1]=3: OK >> (0) >>>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_cib_callback: Update 400 for fail-count-postfix[mail2]=(null): >> OK >>>>> (0) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Forwarding cib_modify operation for section >> status >>>> to >>>>> master (origin=local/attrd/401) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.17 2 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.18 (null) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=18 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']: >>>>> @value=1457613440 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail2/attrd/401, version=0.197.18) >>>>> Mar 10 13:37:21 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_cib_callback: Update 401 for last-failure-postfix: OK (0) >>>>> Mar 10 13:37:21 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_cib_callback: Update 401 for >>>>> last-failure-postfix[mail1]=1457613440: OK (0) >>>>> Mar 10 13:37:21 [7418] HWJ-626.domain.local attrd: info: >>>>> attrd_cib_callback: Update 401 for >>>>> last-failure-postfix[mail2]=1457610376: OK (0) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> abort_transition_graph: Transition aborted by >>>>> status-1-last-failure-postfix, last-failure-postfix=1457613440: >> Transient >>>>> attribute change (modify cib=0.197.18, source=abort_unless_down:319, >>>>> >>>> >> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'], >>>>> 1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> unpack_config: On loss of CCM Quorum: Ignore >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_online_status: Node mail1 is online >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_online_status: Node mail2 is online >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource mail:0 active in >>>>> master mode on mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource spool:0 active >> in >>>>> master mode on mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource fs-spool active >> on >>>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource fs-mail active >> on >>>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: warning: >>>>> unpack_rsc_op_failure: Processing failed op monitor for postfix >> on >>>>> mail1: not running (7) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource spool:1 active >> in >>>>> master mode on mail2 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource mail:1 active in >>>>> master mode on mail2 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> group_print: Resource Group: network-services >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: virtualip-1 (ocf::heartbeat:IPaddr2): >>>> Started >>>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> clone_print: Master/Slave Set: spool-clone [spool] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Masters: [ mail1 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Slaves: [ mail2 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> clone_print: Master/Slave Set: mail-clone [mail] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Masters: [ mail1 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Slaves: [ mail2 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> group_print: Resource Group: fs-services >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: fs-spool (ocf::heartbeat:Filesystem): Started >>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: fs-mail (ocf::heartbeat:Filesystem): Started >>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> group_print: Resource Group: mail-services >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: postfix (ocf::heartbeat:postfix): FAILED >>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> get_failcount_full: postfix has failed 3 times on mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: warning: >>>>> common_apply_stickiness: Forcing postfix away from mail1 after 3 >>>>> failures (max=3) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: Promoting mail:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: mail-clone: Promoted 1 instances of a possible 1 to >> master >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: Promoting spool:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: spool-clone: Promoted 1 instances of a possible 1 to >> master >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> rsc_merge_weights: postfix: Rolling back scores from virtualip-1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_color: Resource virtualip-1 cannot run anywhere >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> RecurringOp: Start recurring monitor (45s) for postfix on mail2 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop virtualip-1 (mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave spool:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave spool:1 (Slave mail2) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave mail:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave mail:1 (Slave mail2) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop fs-spool (Started mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop fs-mail (Started mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop postfix (Started mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> process_pe_message: Calculated Transition 1235: >>>>> /var/lib/pacemaker/pengine/pe-input-302.bz2 >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> handle_response: pe_calc calculation pe_calc-dc-1457613441-3756 is >>>>> obsolete >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> unpack_config: On loss of CCM Quorum: Ignore >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_online_status: Node mail1 is online >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_online_status: Node mail2 is online >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource mail:0 active in >>>>> master mode on mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource spool:0 active >> in >>>>> master mode on mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource fs-spool active >> on >>>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource fs-mail active >> on >>>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: warning: >>>>> unpack_rsc_op_failure: Processing failed op monitor for postfix >> on >>>>> mail1: not running (7) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource spool:1 active >> in >>>>> master mode on mail2 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> determine_op_status: Operation monitor found resource mail:1 active in >>>>> master mode on mail2 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> group_print: Resource Group: network-services >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: virtualip-1 (ocf::heartbeat:IPaddr2): >>>> Started >>>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> clone_print: Master/Slave Set: spool-clone [spool] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Masters: [ mail1 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Slaves: [ mail2 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> clone_print: Master/Slave Set: mail-clone [mail] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Masters: [ mail1 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> short_print: Slaves: [ mail2 ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> group_print: Resource Group: fs-services >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: fs-spool (ocf::heartbeat:Filesystem): Started >>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: fs-mail (ocf::heartbeat:Filesystem): Started >>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> group_print: Resource Group: mail-services >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_print: postfix (ocf::heartbeat:postfix): FAILED >>>> mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> get_failcount_full: postfix has failed 3 times on mail1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: warning: >>>>> common_apply_stickiness: Forcing postfix away from mail1 after 3 >>>>> failures (max=3) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: Promoting mail:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: mail-clone: Promoted 1 instances of a possible 1 to >> master >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: Promoting spool:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> master_color: spool-clone: Promoted 1 instances of a possible 1 to >> master >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> rsc_merge_weights: postfix: Rolling back scores from virtualip-1 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> native_color: Resource virtualip-1 cannot run anywhere >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> RecurringOp: Start recurring monitor (45s) for postfix on mail2 >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop virtualip-1 (mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave spool:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave spool:1 (Slave mail2) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave mail:0 (Master mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: info: >>>>> LogActions: Leave mail:1 (Slave mail2) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop fs-spool (Started mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop fs-mail (Started mail1) >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> LogActions: Stop postfix (Started mail1) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> do_state_transition: State transition S_POLICY_ENGINE -> >>>>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE >>>>> origin=handle_response ] >>>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local pengine: notice: >>>>> process_pe_message: Calculated Transition 1236: >>>>> /var/lib/pacemaker/pengine/pe-input-303.bz2 >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> do_te_invoke: Processing graph 1236 (ref=pe_calc-dc-1457613441-3757) >>>>> derived from /var/lib/pacemaker/pengine/pe-input-303.bz2 >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> te_rsc_command: Initiating action 12: stop virtualip-1_stop_0 on >>>> mail1 >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> te_rsc_command: Initiating action 5: stop postfix_stop_0 on mail1 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.18 2 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.19 (null) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=19 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']: >>>>> @operation_key=virtualip-1_stop_0, @operation=stop, >>>>> @transition-key=12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @transition-magic=0:0;12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @call-id=1276, @last-run=1457613441, @last-rc-change=1457613441, >>>>> @exec-time=66 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail1/crmd/197, version=0.197.19) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> match_graph_event: Action virtualip-1_stop_0 (12) confirmed on mail1 >>>>> (rc=0) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.19 2 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.20 (null) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=20 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']: >>>>> @operation_key=postfix_stop_0, @operation=stop, >>>>> @transition-key=5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @transition-magic=0:0;5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @call-id=1278, @last-run=1457613441, @last-rc-change=1457613441, >>>>> @exec-time=476 >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> match_graph_event: Action postfix_stop_0 (5) confirmed on mail1 >> (rc=0) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> te_rsc_command: Initiating action 79: stop fs-mail_stop_0 on >> mail1 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail1/crmd/198, version=0.197.20) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.20 2 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.21 (null) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=21 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-mail']/lrm_rsc_op[@id='fs-mail_last_0']: >>>>> @operation_key=fs-mail_stop_0, @operation=stop, >>>>> @transition-key=79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @transition-magic=0:0;79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @call-id=1280, @last-run=1457613441, @last-rc-change=1457613441, >>>>> @exec-time=88, @queue-time=1 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail1/crmd/199, version=0.197.21) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> match_graph_event: Action fs-mail_stop_0 (79) confirmed on mail1 >>>> (rc=0) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> te_rsc_command: Initiating action 77: stop fs-spool_stop_0 on >> mail1 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: --- 0.197.21 2 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: Diff: +++ 0.197.22 (null) >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + /cib: @num_updates=22 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_perform_op: + >>>>> >>>> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-spool']/lrm_rsc_op[@id='fs-spool_last_0']: >>>>> @operation_key=fs-spool_stop_0, @operation=stop, >>>>> @transition-key=77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @transition-magic=0:0;77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, >>>>> @call-id=1282, @last-run=1457613441, @last-rc-change=1457613441, >>>>> @exec-time=86 >>>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >> status: >>>> OK >>>>> (rc=0, origin=mail1/crmd/200, version=0.197.22) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >>>>> match_graph_event: Action fs-spool_stop_0 (77) confirmed on mail1 >>>> (rc=0) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: warning: >>>>> run_graph: Transition 1236 (Complete=11, Pending=0, Fired=0, >>>> Skipped=0, >>>>> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-303.bz2): >>>>> Terminated >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: warning: >>>>> te_graph_trigger: Transition failed: terminated >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_graph: Graph 1236 with 12 actions: batch-limit=12 jobs, >>>>> network-delay=0ms >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 16]: Completed pseudo op >>>>> network-services_stopped_0 on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 15]: Completed pseudo op >>>>> network-services_stop_0 on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 12]: Completed rsc op >> virtualip-1_stop_0 >>>>> on mail1 (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 84]: Completed pseudo op >>>>> fs-services_stopped_0 on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 83]: Completed pseudo op >>>> fs-services_stop_0 >>>>> on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 77]: Completed rsc op fs-spool_stop_0 >>>>> on mail1 (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 79]: Completed rsc op fs-mail_stop_0 >>>>> on mail1 (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 90]: Completed pseudo op >>>>> mail-services_stopped_0 on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 89]: Completed pseudo op >>>>> mail-services_stop_0 on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 86]: Pending rsc op >> postfix_monitor_45000 >>>>> on mail2 (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: * [Input 85]: Unresolved dependency rsc op >>>>> postfix_start_0 on mail2 >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 5]: Completed rsc op postfix_stop_0 >>>>> on mail1 (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> print_synapse: [Action 8]: Completed pseudo op all_stopped >>>>> on N/A (priority: 0, waiting: none) >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: info: >> do_log: >>>>> FSA: Input I_TE_SUCCESS from notify_crmd() received in state >>>>> S_TRANSITION_ENGINE >>>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local crmd: notice: >>>>> do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ >>>>> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] >>>>> Mar 10 13:37:26 [7415] HWJ-626.domain.local cib: info: >>>>> cib_process_ping: Reporting our current digest to mail2: >>>>> 3896ee29cdb6ba128330b0ef6e41bd79 for 0.197.22 (0x1544a30 0) >> >> > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org