Andrei, Andrew, I'm afraid the machine is not available to me anymore, i'm sorry.
I reproduced the problem on my laptop. It seems that a failing to start colocated resource made some othe other resources to be demoted and stopped. So this is most likely a non issue. I'll monitor this issue and let you know. Thank you 2014-10-28 5:24 GMT+01:00 Andrew Beekhof <and...@beekhof.net>: > >> On 24 Oct 2014, at 9:00 pm, Sékine Coulibaly <scoulib...@gmail.com> wrote: >> >> Hi Andrew, >> >> Yep, forgot the attachments. I did reproduce the issue, please find >> the bz2 files attached. Please tell if you need hb_report being used. > > Yep, I need the log files to put these into context > >> >> Thank you ! >> >> >> 2014-10-07 5:07 GMT+02:00 Andrew Beekhof <and...@beekhof.net>: >>> I think you forgot the attachments (and my eyes are going blind trying to >>> read the word-wrapped logs :-) >>> >>> On 26 Sep 2014, at 6:37 pm, Sékine Coulibaly <scoulib...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> I'm trying my best to diagnose a strange behaviour of my cluster. >>>> >>>> My cluster is basically a Master-Slave PostgreSQL cluster, with a VIP. >>>> Two nodes (clustera and clusterb). I'm running RHEL 6.5, Corosync >>>> 1.4.1-1 and Pacemaker 1.1.10. >>>> >>>> For the simplicity sake of the diagnostic, I took of the slave node. >>>> >>>> My problem is that the cluster properly promotes the POSTGRESQL >>>> resource once (I issue a resource cleanup MS_POSTGRESQL to reset >>>> failcount counter, and then all resources are mounted on clustera). >>>> After a Corosync restart, the POSTGRESQL resource is not promoted. >>>> >>>> I narrowed down to the point where I add a location constraint >>>> (without this location constraint, after a Corosync restart, >>>> POSTGRESQL resource is promoted): >>>> >>>> location VIP_MGT_needs_gw VIP_MGT rule -inf: not_defined pingd or pingd >>>> lte 0 >>>> >>>> The logs show that the pingd attribute value is 1000 (the ping IP is >>>> pingable, and pinged [used tcpdump]). This attribute is set by : >>>> primitive ping_eth1_mgt_gw ocf:pacemaker:ping params >>>> host_list=178.3.1.47 multiplier=1000 op monitor interval=10s meta >>>> migration-threshold=3 >>>> >>>> From corosync.log I can see : >>>> Sep 26 09:49:36 [22188] clustera pengine: notice: LogActions: >>>> Start POSTGRESQL:0 (clustera) >>>> Sep 26 09:49:36 [22188] clustera pengine: info: LogActions: >>>> Leave POSTGRESQL:1 (Stopped) >>>> [...] >>>> Sep 26 09:49:36 [22186] clustera lrmd: info: log_execute: >>>> executing - rsc:POSTGRESQL action:start call_id:20 >>>> [...] >>>> Sep 26 09:49:37 [22187] clustera attrd: notice: >>>> attrd_trigger_update: Sending flush op to all hosts for: >>>> master-POSTGRESQL (50) >>>> [...] >>>> Sep 26 09:49:37 [22189] clustera crmd: info: >>>> match_graph_event: Action POSTGRESQL_notify_0 (46) confirmed on >>>> clustera (rc=0) >>>> [...] >>>> Sep 26 09:49:38 [22186] clustera lrmd: info: log_finished: >>>> finished - rsc:ping_eth1_mgt_gw action:start call_id:22 pid:22352 >>>> exit-code:0 exec-time:2175ms queue-time:0ms >>>> [...] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: clone_print: >>>> Master/Slave Set: MS_POSTGRESQL [POSTGRESQL] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: short_print: >>>> Slaves: [ clustera ] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: short_print: >>>> Stopped: [ clusterb ] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: native_print: >>>> VIP_MGT (ocf::heartbeat:IPaddr2): Stopped >>>> Sep 26 09:49:38 [22188] clustera pengine: info: clone_print: >>>> Clone Set: cloned_ping_eth1_mgt_gw [ping_eth1_mgt_gw] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: short_print: >>>> Started: [ clustera ] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: short_print: >>>> Stopped: [ clusterb ] >>>> Sep 26 09:49:38 [22188] clustera pengine: info: >>>> rsc_merge_weights: VIP_MGT: Rolling back scores from >>>> MS_POSTGRESQL >>>> Sep 26 09:49:38 [22188] clustera pengine: info: native_color: >>>> Resource VIP_MGT cannot run anywhere >>>> Sep 26 09:49:38 [22188] clustera pengine: info: native_color: >>>> POSTGRESQL:1: Rolling back scores from VIP_MGT >>>> Sep 26 09:49:38 [22188] clustera pengine: info: native_color: >>>> Resource POSTGRESQL:1 cannot run anywhere >>>> Sep 26 09:49:38 [22188] clustera pengine: info: master_color: >>>> MS_POSTGRESQL: Promoted 0 instances of a possible 1 to master >>>> Sep 26 09:49:38 [22188] clustera pengine: info: native_color: >>>> Resource ping_eth1_mgt_gw:1 cannot run anywhere >>>> Sep 26 09:49:38 [22188] clustera pengine: info: RecurringOp: >>>> Start recurring monitor (60s) for POSTGRESQL:0 on clustera >>>> Sep 26 09:49:38 [22188] clustera pengine: info: RecurringOp: >>>> Start recurring monitor (60s) for POSTGRESQL:0 on clustera >>>> Sep 26 09:49:38 [22188] clustera pengine: info: RecurringOp: >>>> Start recurring monitor (10s) for ping_eth1_mgt_gw:0 on clustera >>>> Sep 26 09:49:38 [22188] clustera pengine: info: LogActions: >>>> Leave POSTGRESQL:0 (Slave clustera) >>>> Sep 26 09:49:38 [22188] clustera pengine: info: LogActions: >>>> Leave POSTGRESQL:1 (Stopped) >>>> Sep 26 09:49:38 [22188] clustera pengine: info: LogActions: >>>> Leave VIP_MGT (Stopped) >>>> Sep 26 09:49:38 [22188] clustera pengine: info: LogActions: >>>> Leave ping_eth1_mgt_gw:0 (Started clustera) >>>> Sep 26 09:49:38 [22188] clustera pengine: info: LogActions: >>>> Leave ping_eth1_mgt_gw:1 (Stopped) >>>> [...] >>>> >>>> Then everything goes weird. The POSTGRESQL is monitored, seen as >>>> master (rc=8). Since it is expected (???) to be OCF_RUNNING, the >>>> monitor operation failed. That's wierd is n't it, or am I missing >>>> something ? : >>>> >>>> Sep 26 09:49:38 [22189] clustera crmd: info: >>>> do_state_transition: State transition S_POLICY_ENGINE -> >>>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE >>>> origin=handle_response ] >>>> Sep 26 09:49:38 [22189] clustera crmd: info: do_te_invoke: >>>> Processing graph 1 (ref=pe_calc-dc-1411717778-16) derived from >>>> /var/lib/pacemaker/pengine/pe-input-1057.bz2 >>>> Sep 26 09:49:38 [22189] clustera crmd: notice: te_rsc_command: >>>> Initiating action 6: monitor POSTGRESQL_monitor_60000 on clustera >>>> (local) >>>> Sep 26 09:49:38 [22189] clustera crmd: info: do_lrm_rsc_op: >>>> Performing key=6:1:0:613bb344-94ce-4210-879c-6d6e90f35871 >>>> op=POSTGRESQL_monitor_60000 >>>> Sep 26 09:49:38 [22189] clustera crmd: notice: te_rsc_command: >>>> Initiating action 33: monitor ping_eth1_mgt_gw_monitor_10000 on >>>> clustera (local) >>>> Sep 26 09:49:38 [22189] clustera crmd: info: do_lrm_rsc_op: >>>> Performing key=33:1:0:613bb344-94ce-4210-879c-6d6e90f35871 >>>> op=ping_eth1_mgt_gw_monitor_10000 >>>> Sep 26 09:49:38 [22184] clustera cib: info: crm_client_new: >>>> Connecting 0x1bbcc00 for uid=0 gid=0 pid=22767 >>>> id=0ef393a5-6f7b-4893-9b06-0fab26d9a06d >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> cib_process_request: Completed cib_query operation for section >>>> nodes: OK (rc=0, origin=local/crm_attribute/2, version=0.692.9) >>>> Sep 26 09:49:38 [22187] clustera attrd: notice: >>>> attrd_trigger_update: Sending flush op to all hosts for: >>>> master-POSTGRESQL (100) >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> cib_process_request: Completed cib_query operation for section >>>> //cib/status//node_state[@id='clustera']//transient_attributes//nvpair[@name='master-POSTGRESQL']: >>>> OK (rc=0, origin=local/attrd/8, version=0.692.9) >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> crm_client_destroy: Destroying 0 events >>>> Sep 26 09:49:38 [22187] clustera attrd: notice: >>>> attrd_perform_update: Sent update 9: master-POSTGRESQL=100 >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> cib_process_request: Completed cib_modify operation for section >>>> status: OK (rc=0, origin=local/attrd/9, version=0.692.10) >>>> Sep 26 09:49:38 [22189] clustera crmd: info: >>>> abort_transition_graph: te_update_diff:172 - Triggered transition >>>> abort (complete=0, node=clustera, tag=nvpair, >>>> id=status-clustera-master-POSTGRESQL, name=master-POSTGRESQL, >>>> value=100, magic=NA, cib=0.692.10) : Transient attribute: update >>>> Sep 26 09:49:38 [22189] clustera crmd: notice: >>>> process_lrm_event: LRM operation POSTGRESQL_monitor_60000 >>>> (call=29, rc=8, cib-update=33, confirmed=false) master >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> cib_process_request: Completed cib_modify operation for section >>>> status: OK (rc=0, origin=local/crmd/33, version=0.692.11) >>>> Sep 26 09:49:38 [22189] clustera crmd: warning: status_from_rc: >>>> Action 6 (POSTGRESQL_monitor_60000) on clustera failed (target: 0 vs. >>>> rc: 8): Error >>>> Sep 26 09:49:38 [22189] clustera crmd: warning: >>>> update_failcount: Updating failcount for POSTGRESQL on clustera >>>> after failed monitor: rc=8 (update=value++, time=1411717778) >>>> Sep 26 09:49:38 [22189] clustera crmd: info: >>>> abort_transition_graph: match_graph_event:313 - Triggered transition >>>> abort (complete=0, node=clustera, tag=lrm_rsc_op, >>>> id=POSTGRESQL_last_failure_0, >>>> magic=0:8;6:1:0:613bb344-94ce-4210-879c-6d6e90f35871, cib=0.692.11) : >>>> Event failed >>>> Sep 26 09:49:38 [22189] clustera crmd: info: >>>> match_graph_event: Action POSTGRESQL_monitor_60000 (6) confirmed >>>> on clustera (rc=4) >>>> Sep 26 09:49:38 [22189] clustera crmd: warning: >>>> update_failcount: Updating failcount for POSTGRESQL on clustera >>>> after failed monitor: rc=8 (update=value++, time=1411717778) >>>> Sep 26 09:49:38 [22189] clustera crmd: info: >>>> process_graph_event: Detected action (1.6) >>>> POSTGRESQL_monitor_60000.29=master: failed >>>> Sep 26 09:49:38 [22187] clustera attrd: notice: >>>> attrd_trigger_update: Sending flush op to all hosts for: >>>> fail-count-POSTGRESQL (1) >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> cib_process_request: Completed cib_query operation for section >>>> //cib/status//node_state[@id='clustera']//transient_attributes//nvpair[@name='fail-count-POSTGRESQL']: >>>> No such device or address (rc=-6, origin=local/attrd/10, >>>> version=0.692.11) >>>> Sep 26 09:49:38 [22184] clustera cib: info: >>>> cib_process_request: Completed cib_query operation for section >>>> /cib: OK (rc=0, origin=local/attrd/11, version=0.692.11) >>>> Sep 26 09:49:38 [22187] clustera attrd: notice: >>>> attrd_perform_update: Sent update 12: fail-count-POSTGRESQL=1 >>>> >>>> Attached are the bz2 files and the corosync.log. >>>> >>>> Any help is appreciated ! >>>> >>>> Sekine >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> <pe-input-208.bz2><pe-input-209.bz2><pe-input-210.bz2><pe-input-211.bz2><pe-input-212.bz2><pe-input-213.bz2><pe-input-214.bz2><pe-input-215.bz2><pe-input-216.bz2><pe-input-217.bz2><pe-input-218.bz2><pe-input-219.bz2><pe-input-220.bz2><pe-input-221.bz2><pe-input-222.bz2><pe-input-223.bz2>_______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org