So far so bad. I made a dummy OCF script [0] to simulate an example promote/demote/notify failure mode for a multistate clone resource which is very similar to the one I reported originally. And the test to reproduce my case with the dummy is: - install dummy resource ocf ra and create the dummy resource as README [0] says - just watch the a) OCF logs from the dummy and b) outputs for the reoccurring commands:
# while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1; sleep 20; done& # crm_resource --resource p_dummy --list-operations At some point I noticed: - there are no more "OK" messages logged from the monitor actions, although according to the trace_ra dumps' timestamps, all monitors are still being invoked! - at some point I noticed very strange results reported by the: # crm_resource --resource p_dummy --list-operations p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000 (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan 4 14:33:07 2016, exec=62107ms): Timed Out or p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000 (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4 14:43:58 2016, exec=0ms): Timed Out - according to the trace_ra dumps reoccurring monitors are being invoked by the intervals *much longer* than configured. For example, a 7 minutes of "monitoring silence": Mon Jan 4 14:47:46 UTC 2016 p_dummy.monitor.2016-01-04.14:40:52 Mon Jan 4 14:48:06 UTC 2016 p_dummy.monitor.2016-01-04.14:47:58 Given that said, it is very likely there is some bug exist for monitoring multi-state clones in pacemaker! [0] https://github.com/bogdando/dummy-ocf-ra -- Best regards, Bogdan Dobrelya, Irc #bogdando _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org