On 01/04/2016 08:50 AM, Bogdan Dobrelya wrote: > So far so bad. > I made a dummy OCF script [0] to simulate an example > promote/demote/notify failure mode for a multistate clone resource which > is very similar to the one I reported originally. And the test to > reproduce my case with the dummy is: > - install dummy resource ocf ra and create the dummy resource as README > [0] says > - just watch the a) OCF logs from the dummy and b) outputs for the > reoccurring commands: > > # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1; > sleep 20; done& > # crm_resource --resource p_dummy --list-operations > > At some point I noticed: > - there are no more "OK" messages logged from the monitor actions, > although according to the trace_ra dumps' timestamps, all monitors are > still being invoked!
Yes, that's to reduce log clutter / I/O (which especially matters when you scale to hundreds of resources). As long as a recurring monitor is OK, only the first OK is logged. > - at some point I noticed very strange results reported by the: > # crm_resource --resource p_dummy --list-operations > p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000 > (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan > 4 14:33:07 2016, exec=62107ms): Timed Out > or > p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000 > (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4 > 14:43:58 2016, exec=0ms): Timed Out Note that these are on different nodes. When pacemaker starts a resource, it first "probes" all nodes by running a one-time monitor operation on them, to ensure the service is not already running somewhere. So those are expected to "fail". Your dummy RA always returns OCF_SUCCESS for status/monitor, which will cause problems. Pacemaker will think it's already running everywhere, and not try to start it. A master/slave resource should use these return codes: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_requirements_for_multi_state_resource_agents > - according to the trace_ra dumps reoccurring monitors are being invoked > by the intervals *much longer* than configured. For example, a 7 minutes > of "monitoring silence": > Mon Jan 4 14:47:46 UTC 2016 > p_dummy.monitor.2016-01-04.14:40:52 > Mon Jan 4 14:48:06 UTC 2016 > p_dummy.monitor.2016-01-04.14:47:58 > > Given that said, it is very likely there is some bug exist for > monitoring multi-state clones in pacemaker! > > [0] https://github.com/bogdando/dummy-ocf-ra > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org