[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Cluster timeout

Ulrich Windl Thu, 10 Mar 2022 00:20:01 -0800

Hi Thierry!

Having a glance at the log, I wonder:
* Why is the start for pgsql_mail returning an "unknown error (1)"
* Why is demote for drbd_pgsql:1 returning an "unknown error (1)"?
* Your DC (dvs47713) went offline


So the first action plan is:
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:0    (Master dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:1    (Stopped)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_mail      (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_fs        (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_lsb       (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_vip       (Started dvs42832)

(BTW: You may want to limit the number of policy files kept)

As the cluster goes to IDLE mode, I must assume that you have no fencing
confiugured:
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:0    (Master dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:1    (Stopped)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_mail      (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_fs        (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_lsb       (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_vip       (Started dvs42832)


The cluster seems unable to react until:
[22170] dvs42832 corosyncnotice  [MAIN  ] Completed service synchronization,
ready to provide service.

As the DC was not fenced, you have two of them:
Mar 09 09:26:22 [22179] dvs42832       crmd:  warning:
crmd_ha_msg_filter:     Another DC detected: dvs47713 (op=noop)

(Re-join after split brain is risky)

After rejoiun, the cluster handles the failure:
Mar 09 09:26:24 [22178] dvs42832    pengine:  warning:
unpack_rsc_op_failure:  Forcing drbd_pgsql:1 to stop after a failed
demote action

So the next action plan is:
Mar 09 09:26:24 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:0    (Master dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:      Demote 
drbd_pgsql:1    (Master -> Slave dvs47713)
Mar 09 09:26:24 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_mail      (Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:      Restart
pgsql_fs        (Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:      Restart
pgsql_lsb       (Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:      Restart
pgsql_vip       (Started dvs42832)

Then:
Mar 09 09:26:27 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:0    (Master dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:   notice: LogActions:      Stop   
drbd_pgsql:1    (dvs47713)
Mar 09 09:26:27 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_mail      (Started dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_fs        (Started dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:   notice: LogActions:      Start  
pgsql_lsb       (dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:   notice: LogActions:      Start  
pgsql_vip       (dvs42832)

Then:
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:0    (Master dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:   notice: LogActions:      Start  
drbd_pgsql:1    (dvs47713)
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_mail      (Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_fs        (Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_lsb       (Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:   notice: LogActions:      Start  
pgsql_vip       (dvs42832)

Eventually:
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:0    (Master dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:      Leave  
drbd_pgsql:1    (Slave dvs47713)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_mail      (Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_fs        (Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_lsb       (Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:      Leave  
pgsql_vip       (Started dvs42832)

So it seems you have three problems:
1) some resource operation failing
2) network problems
3) no fencing configured

Just adjusting some timeouts woun't help much in this situation.

Regards,
Ulrich

>>> FLORAC Thierry <thierry.flo...@onf.fr> schrieb am 09.03.2022 um 18:24 in
Nachricht
<pr2p264mb076785e16fad8f054d972c0ef5...@pr2p264mb0767.frap264.prod.outlook.com>:

> He is an extract of "corosync.log"...
> 
> Thierry
> 
> ________________________________
> De : Users <users-boun...@clusterlabs.org> de la part de Ulrich Windl 
> <ulrich.wi...@rz.uni-regensburg.de>
> Envoyé : mercredi 9 mars 2022 17:13
> À : users@clusterlabs.org <users@clusterlabs.org>
> Objet : [ClusterLabs] Antw: Re: Antw: [EXT] Cluster timeout
> 
>>>> FLORAC Thierry <thierry.flo...@onf.fr> schrieb am 09.03.2022 um 16:56 in
> Nachricht
>
<pr2p264mb07678dbb0517cb8c7695627cf5...@pr2p264mb0767.frap264.prod.outlook.co

> M>:
> 
>>>>> FLORAC Thierry <thierry.flo...@onf.fr> schrieb am 09.03.2022 um 11:46
in
>> Nachricht
>>
>
<pr2p264mb076751671fc57f33b995f851f5...@pr2p264mb0767.frap264.prod.outlook.co

> 
>> M>:
>>
>>> Hi,
>>>
>>> I manage an active/passive PostgreSQL cluster using DRBD, LVM, Pacemaker
> and
>>
>>> Corosync on a Debian GNU/Linux operating system.
>>> Everything is OK, but my platform seems to be quite "sensitive" to small
>>> network timeouts which are generating a cluster migration start from
> active
>>
>>> to passive node; generally, the process doesn't go through to the end: as
>>> soon as the connection is back again, the migration is cancelled and the
>>> database restarts!
>>
>> Could it be you run without fencing? Maybe show some logs!
>>
>> Logs are quite verbose and not very easy to understand...
>> What log would you need?
> 
> Those showing what happens when the network goes down, and what happens
when
> the network comes up.
> Usually the DC writes some good "action summaries" (typically after
> "pacemaker-controld[7236]:  notice: State transition S_IDLE ->
> S_POLICY_ENGINE"). Those would be helpful.
> 
>>
>>> That should be OK but on the application side, some database connections
> (on
>>
>>> a Java WildFly server) can become "invalid"! So I would like to avoid
> these
>>
>>> migrations when this kind of small timeout occurs...
>>>
>>> So my question is: which cluster settings can I change to increase the
>>> timeout before starting a cluster migration?
>>>
>>> Best regards,
>>> Thierry
>>>
>>>
>>>
>>> Thierry Florac
>>> Resp. Pôle Architecture Applicative et Mobile
>>> DSI ‑ Dépt. Études et Solutions Tranverses
>>> 2, avenue de Saint‑Mandé ‑ 75570 Paris cedex 12
>>> Tél : 01 40 19 59 64
>>> www.onf.fr <https://www.onf.f<https://www.onf.fr>r
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Cluster timeout

Reply via email to