Re: [ClusterLabs] Strange monitor return code log for LSB resource

Jan Pokorný Thu, 04 Jul 2019 10:41:26 -0700

Harvey,

On 25/06/19 21:26 +0000, Harvey Shepherd wrote:
> Thanks for your reply Andrei. There is no external monitoring
> software running. The logs that I posted are from the pacemaker log
> with debug enabled.


that's exactly what's questionable here -- how can crm_resource
invocation be at all correlated to the standard operation of pacemaker
cluster if you claim you don't have a clue what else (such as external
monitoring built upon crm_resource or, in extension upon crm/pcs)
could be triggering it.  That's where Andrei was zeroing in.

Anyway, to spare you possibly heavyhanded inspection of circumstances
in your cluster, I'd suggest you install system and run this command
as root:

  stap /usr/share/systemtap/examples/process/pstrace_exec.stp /logging

where pstrace_exec.stp can also be obtained from
https://sourceware.org/systemtap/examples/process/pstrace_exec.stp 
if your installation (systemtap-client package in my case) doesn't
carry it.

During the time, you'll likely observe occurrences where "logging"
script is run directly by pacemaker-execd, plus some process chains
finishing with crm_resource and these shall suggest the cause of
why that is being run at all (where it likely should not, see also
the remark on re-entrancy below).

> The Linux hostname of the node is "ctr_qemu", but the Pacemaker name
> is "primary". It is actually the same node.
> 
> I've just run the same test again and interestingly this time the
> source of the monitor logs seems to be shared between "crm_resource"
> and "pacemaker-schedulerd"
> 
> Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (recurring_action_timer)  
>   debug: Scheduling another invocation of logging_status_10000
> Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (operation_finished)      
>   debug: logging_status_10000:1181 - exited with rc=0
> Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (operation_finished)      
>   debug: logging_status_10000:1181:stdout [ Remote syslog service is running ]
> Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (log_finished)      
> debug: finished - rsc:logging action:monitor call_id:127 pid:1181 exit-code:0 
> exec-time:0ms queue-time:0ms
> Jun 25 21:18:35 ctr_qemu crm_resource        [1268] (determine_op_status)     
>   debug: logging_monitor_10000 on primary returned 'not running' (7) instead 
> of the expected value: 'ok' (0)
> Jun 25 21:18:35 ctr_qemu crm_resource        [1268] (unpack_rsc_op_failure)   
>   warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (determine_op_status)     
>   debug: logging_monitor_10000 on primary returned 'not running' (7) instead 
> of the expected value: 'ok' (0)
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (unpack_rsc_op_failure)   
>   warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_print)      info: 
> logging   (lsb:logging):      Started primary
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_apply_stickiness) 
>   debug: Resource logging: preferring current location (node=primary, 
> weight=1)
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (native_assign_node)      
>   debug: Assigning primary to logging
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (LogActions)        info: 
> Leave   logging   (Started primary)

another thing, actually:

> 25.06.2019 16:53, Harvey Shepherd пишет:
>> Pacemaker is reporting failed resource actions, but fail-count is not 
>> incremented for the resource:
>> 
>> Migration Summary:
>> * Node primary:
>> * Node secondary:
>> 
>> Failed Resource Actions:
>> * logging_monitor_10000 on primary 'not running' (7): call=119, 
>> status=complete, exitreason='',
>>     last-rc-change='Tue Jun 25 13:13:12 2019', queued=0ms, exec=0ms

Gotta wonder this can only be caused with what _exclusively_
pacemaker-execd observed (but that could be a result of interference
with crm_resource initiated operation ... with --force-check parameter,
or what else?), sadly, the log didn't contain the referenced time
of "last-rc-change".

Finally, speaking of interferences, this topic has been brought up
recently, regarding re-entrancy in particular:

https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html

Apparently, axioms of agent/resource execution that apply with pacemaker
alone won't hold when there are multiple actors as it seems to be your
case (pacemaker vs. isolated crm_resource tool based trigger).


Hope this helps

-- 
Jan (Poki)

pgpWwEAgXpABx.pgp
Description: PGP signature

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Strange monitor return code log for LSB resource

Reply via email to