31.12.2015 15:33:45 CET, Bogdan Dobrelya <bdobre...@mirantis.com> wrote: >On 31.12.2015 14:48, Vladislav Bogdanov wrote: >> blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc, >quick google search should point you to Andrew's blog with all >information about that feature. >> Next, if you use ocf-shellfuncs in your RA, you could enable tracing >for resource itself, just add 'trace_ra=1' to every operation config >(start and monitor). > >Thank you, I will try to play with these things once I have the issue >reproduced again. Cannot provide CIB as I don't have the env now. > >But still let me ask again, do anyone know or heard of anything like >known/fixed bugs about corosync with pacemaker stop running monitor >actions for a resource at some point, while notifications are still >logged? > >Here is example: >node-16 crmd: >2015-12-29T13:16:49.113679+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_monitor_27000: unknown error >(node=node-16.test.domain.local, call=254, rc=1, cib-updat >e=1454, confirmed=false) >node-17: >2015-12-29T13:16:57.603834+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_monitor_103000: unknown error >(node=node-17.test.domain.local, call=181, rc=1, cib-upda >te=297, confirmed=false) >node-18: >2015-12-29T13:20:16.870619+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_monitor_103000: not running >(node=node-18.test.domain.local, call=187, rc=7, cib-update >=306, confirmed=false) >node-20: >2015-12-29T13:20:51.486219+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_monitor_30000: not running >(node=node-20.test.domain.local, call=180, rc=7, cib-update= >308, confirmed=false) > >after that point only notifications got logged for affected nodes, like >Operation p_rabbitmq-server_notify_0: ok >(node=node-20.test.domain.local, call=287, rc=0, cib-update=0, >confirmed=t >rue) > >While the node-19 was not affected, and actions >monitor/stop/start/notify logged OK all the time, like: >2015-12-29T14:30:00.973561+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_monitor_30000: not running >(node=node-19.test.domain.local, call=423, rc=7, cib-update=438, >confirmed=false) >2015-12-29T14:30:01.631609+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_notify_0: ok >(node=node-19.test.domain.local, call=424, rc=0, cib-update=0, >confirmed=true) >2015-12-29T14:31:19.084165+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_stop_0: ok (node=node-19.test.domain.local, >call=427, rc=0, cib-update=439, confirmed=true) >2015-12-29T14:32:53.120157+00:00 notice: notice: process_lrm_event: >Operation p_rabbitmq-server_start_0: unknown error >(node=node-19.test.domain.local, call=428, rc=1, cib-update=441, >confirmed=true)
Well, not running and not logged is not the same thing. I do not have access to code right now, but I'm pretty sure that successful recurring monitors are not logged after the first run. trace_ra for monitor op should prove that. If not, then it should be a bug. I recall something was fixed in that area recently. Best, Vladislav _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org