[ClusterLabs] Ignore lost monitoring request

Klecho Wed, 14 Mar 2018 07:23:48 -0700

Hi all,

As Ken said


"Not currently, but that is planned for a future version",

just want to remind how useful would be to have "ignore X monitoringtimeouts" as an option in the newest pacemaker.

Still having big problems with resources restarting because of a lostmonitoring requests, which leads to service interruptions.


Best regards,

Klecho


On 1.09.2017 17:52, Klechomir wrote:

On 1.09.2017 17:21, Jan Pokorný wrote:
On 01/09/17 09:48 +0300, Klechomir wrote:
I have cases, when for an unknown reason a single monitoring request
never returns result.
So having bigger timeouts doesn't resolve this problem.
If I get you right, the pain point here is a command called by the
resource agents during monitor operation, while this command under
some circumstances _never_ terminates (for dead waiting, infinite
loop, or whatever other reason) or possibly terminates based on
external/asynchronous triggers (e.g. network connection gets
reestablished).

Stating obvious, the solution should be:
- work towards fixing such particular command if blocking
   is an unexpected behaviour (clarify this with upstream
   if needed)
- find more reliable way for the agent to monitor the resource

For the planned soft-recovery options Ken talked about, I am not
sure if it would be trivially possible to differentiate exceeded
monitor timeout from a plain monitor failure.
In any case currently there is no differentiation between failedmonitoring request and timeouts, so a parameter for ignoring X failsin a row would be very welcome for me.
Here is one very fresh example, entirely unrelated to LV&I/O:
Aug 30 10:44:19 [1686093] CLUSTER-1 crmd: error:process_lrm_event: LRM operation p_PingD_monitor_0 (1148) Timed Out(timeout=20000ms)Aug 30 10:44:56 [1686093] CLUSTER-1 crmd: notice:process_lrm_event: LRM operation p_PingD_stop_0 (call=1234, rc=0,cib-update=40, confirmed=true) okAug 30 10:45:26 [1686093] CLUSTER-1 crmd: notice:process_lrm_event: LRM operation p_PingD_start_0 (call=1240, rc=0,cib-update=41, confirmed=true) okIn this case PingD is fencing drbd and causes unneeded (as the nextmonitoring request is ok) restart of all related resources.
_______________________________________________
Users mailing list:Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:http://bugs.clusterlabs.org


--
Klecho

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Ignore lost monitoring request

Reply via email to