Re: [ClusterLabs] [EXT] Problem with DLM

2022-07-26 Thread Reid Wahl
On Tue, Jul 26, 2022 at 12:36 PM Lentes, Bernd
 wrote:
>
>
>
> - On 26 Jul, 2022, at 20:06, Ulrich Windl 
> ulrich.wi...@rz.uni-regensburg.de wrote:
>
> > Hi Bernd!
> >
> > I think the answer may be some time before the timeout was reported; maybe a
> > network issue? Or a very high load. It's hard to say from the logs...
>
> Yes, i had a high load before:
> Jul 20 00:17:42 [32512] ha-idg-1   crmd:   notice: 
> throttle_check_thresholds:   High CPU load detected: 90.080002
> Jul 20 00:18:12 [32512] ha-idg-1   crmd:   notice: 
> throttle_check_thresholds:   High CPU load detected: 76.169998
> Jul 20 00:18:42 [32512] ha-idg-1   crmd:   notice: 
> throttle_check_thresholds:   High CPU load detected: 85.629997
> Jul 20 00:19:12 [32512] ha-idg-1   crmd:   notice: 
> throttle_check_thresholds:   High CPU load detected: 70.660004
> Jul 20 00:19:42 [32512] ha-idg-1   crmd:   notice: 
> throttle_check_thresholds:   High CPU load detected: 58.34
> Jul 20 00:20:12 [32512] ha-idg-1   crmd: info: 
> throttle_check_thresholds:   Moderate CPU load detected: 48.740002
> Jul 20 00:20:12 [32512] ha-idg-1   crmd: info: throttle_send_command: 
>   New throttle mode: 0010 (was 0100)
> Jul 20 00:20:42 [32512] ha-idg-1   crmd: info: 
> throttle_check_thresholds:   Moderate CPU load detected: 41.88
> Jul 20 00:21:12 [32512] ha-idg-1   crmd: info: throttle_send_command: 
>   New throttle mode: 0001 (was 0010)
> Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: 
> child_timeout_callback:  dlm_monitor_3 process (PID 11816) timed out
> Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: operation_finished:
>   dlm_monitor_3:11816 - timed out after 2ms
> Jul 20 00:21:56 [32512] ha-idg-1   crmd:error: process_lrm_event: 
>   Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 
> key=dlm_monitor_3 timeout=2ms
> Jul 20 00:21:56 [32512] ha-idg-1   crmd: info: exec_alert_list: 
> Sending resource alert via smtp_alert to informatic@helmholtz-muenchen.de
> Jul 20 00:21:56 [12204] ha-idg-1   lrmd: info: 
> process_lrmd_alert_exec: Executing alert smtp_alert for 
> 8f934e90-12f5-4bad-b4f4-55ac933f01c6
>
> Can that interfere with DLM ?

High load can potentially interfere with just about any process,
including the monitor operation of the ocf:pacemaker:controld resource
agent (which is what timed out) or any of its child processes. High
load can be caused by storage latency, overworking the system, or
other assorted factors.

And as Ulrich correctly noted, the kernel messages occur after the
monitor timeout. They were probably an expected part of the cluster's
attempt to recover the resource.

>
> Bernd___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [EXT] Problem with DLM

2022-07-26 Thread Lentes, Bernd


- On 26 Jul, 2022, at 20:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:

> Hi Bernd!
> 
> I think the answer may be some time before the timeout was reported; maybe a
> network issue? Or a very high load. It's hard to say from the logs...

Yes, i had a high load before:
Jul 20 00:17:42 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 90.080002
Jul 20 00:18:12 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 76.169998
Jul 20 00:18:42 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 85.629997
Jul 20 00:19:12 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 70.660004
Jul 20 00:19:42 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 58.34
Jul 20 00:20:12 [32512] ha-idg-1   crmd: info: 
throttle_check_thresholds:   Moderate CPU load detected: 48.740002
Jul 20 00:20:12 [32512] ha-idg-1   crmd: info: throttle_send_command:   
New throttle mode: 0010 (was 0100)
Jul 20 00:20:42 [32512] ha-idg-1   crmd: info: 
throttle_check_thresholds:   Moderate CPU load detected: 41.88
Jul 20 00:21:12 [32512] ha-idg-1   crmd: info: throttle_send_command:   
New throttle mode: 0001 (was 0010)
Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: child_timeout_callback:  
dlm_monitor_3 process (PID 11816) timed out
Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: operation_finished:  
dlm_monitor_3:11816 - timed out after 2ms
Jul 20 00:21:56 [32512] ha-idg-1   crmd:error: process_lrm_event:   
Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 
key=dlm_monitor_3 timeout=2ms
Jul 20 00:21:56 [32512] ha-idg-1   crmd: info: exec_alert_list: Sending 
resource alert via smtp_alert to informatic@helmholtz-muenchen.de
Jul 20 00:21:56 [12204] ha-idg-1   lrmd: info: process_lrmd_alert_exec: 
Executing alert smtp_alert for 8f934e90-12f5-4bad-b4f4-55ac933f01c6

Can that interfere with DLM ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [EXT] Problem with DLM

2022-07-26 Thread Ulrich Windl
Hi Bernd!

I think the answer may be some time before the timeout was reported; maybe a
network issue? Or a very high load. It's hard to say from the logs...


>>> Am 26.07.2022 um 15:32, in Nachricht <6ABA7762.4E4 : 205 : 62692>, "Lentes,
Bernd" 
 schrieb:
Hi,

it seems my DLM went grazy:

/var/log/cluster/corosync.log:
Jul 20 00:21:56 [12204] ha-idg-1 lrmd: warning: child_timeout_callback:
dlm_monitor_3 process (PID 11816) timed out
Jul 20 00:21:56 [12204] ha-idg-1 lrmd: warning: operation_finished:
dlm_monitor_3:11816 - timed out after 2ms
Jul 20 00:21:56 [32512] ha-idg-1 crmd: error: process_lrm_event: Result of
monitor operation for dlm on ha-idg-1: Timed Out | call=1255
key=dlm_monitor_3 timeout=2ms
Jul 20 00:21:56 [32512] ha-idg-1 crmd: info: exec_alert_list: Sending resource
alert via smtp_alert to informatic@helmholtz-muenchen.de

/var/log/messages:
2022-07-20T00:21:56.644677+02:00 ha-idg-1 Cluster: alert_smtp.sh
2022-07-20T00:22:16.076936+02:00 ha-idg-1 kernel: [2366794.757496] dlm:
FD5D3C7CE9104CF5916A84DA0DBED302: leaving the lockspace group...
2022-07-20T00:22:16.364971+02:00 ha-idg-1 kernel: [2366795.045657] dlm:
FD5D3C7CE9104CF5916A84DA0DBED302: group event done 0 0
2022-07-20T00:22:16.364982+02:00 ha-idg-1 kernel: [2366795.045777] dlm:
FD5D3C7CE9104CF5916A84DA0DBED302: release_lockspace final free
2022-07-20T00:22:15.533571+02:00 ha-idg-1 Cluster: message repeated 22 times: [
alert_smtp.sh]
2022-07-20T00:22:17.164442+02:00 ha-idg-1 ocfs2_hb_ctl[19106]: ocfs2_hb_ctl
/sbin/ocfs2_hb_ctl -K -u FD5D3C7CE9104CF5916A84DA0DBED302
2022-07-20T00:22:18.904936+02:00 ha-idg-1 kernel: [2366797.586278] ocfs2:
Unmounting device (254,24) on (node 1084777482)
2022-07-20T00:22:19.116701+02:00 ha-idg-1 Cluster: alert_smtp.sh

What do these kernel messages mean ? Why stopped DLM ? I think this is the
second time this happened. It is really a show stopper because node is fenced
some minutes later:
00:34:40.709002 ha-idg: Fencing Operation Off of ha-idg-1 by ha-idg-2 for
crmd.28253@ha-idg-2: OK (ref=9710f0e2-a9a9-42c3-a294-ed0bd78bba1a)

What can i do ? Is there an alternative DLM ?
System is SLES 12 SP5. Update to SLES 15 SP3 ?

Bernd



--
Bernd Lentes
System Administrator
Institute for Metabolism and Cell Death (MCD)
Building 25 - office 122
HelmholtzZentrum München
bernd.len...@helmholtz-muenchen.de
phone: +49 89 3187 1241
+49 89 3187 49123
fax: +49 89 3187 2294
http://www.helmholtz-muenchen.de/mcd

Public key:
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1
38 43 0e 72 af 02 03 01 00 01

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/