Hi Thierry! Having a glance at the log, I wonder: * Why is the start for pgsql_mail returning an "unknown error (1)" * Why is demote for drbd_pgsql:1 returning an "unknown error (1)"? * Your DC (dvs47713) went offline
So the first action plan is: Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:0 (Master dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:1 (Stopped) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_mail (Started dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_fs (Started dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_lsb (Started dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_vip (Started dvs42832) (BTW: You may want to limit the number of policy files kept) As the cluster goes to IDLE mode, I must assume that you have no fencing confiugured: Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:0 (Master dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:1 (Stopped) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_mail (Started dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_fs (Started dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_lsb (Started dvs42832) Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_vip (Started dvs42832) The cluster seems unable to react until: [22170] dvs42832 corosyncnotice [MAIN ] Completed service synchronization, ready to provide service. As the DC was not fenced, you have two of them: Mar 09 09:26:22 [22179] dvs42832 crmd: warning: crmd_ha_msg_filter: Another DC detected: dvs47713 (op=noop) (Re-join after split brain is risky) After rejoiun, the cluster handles the failure: Mar 09 09:26:24 [22178] dvs42832 pengine: warning: unpack_rsc_op_failure: Forcing drbd_pgsql:1 to stop after a failed demote action So the next action plan is: Mar 09 09:26:24 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:0 (Master dvs42832) Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Demote drbd_pgsql:1 (Master -> Slave dvs47713) Mar 09 09:26:24 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_mail (Started dvs42832) Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Restart pgsql_fs (Started dvs42832) Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Restart pgsql_lsb (Started dvs42832) Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Restart pgsql_vip (Started dvs42832) Then: Mar 09 09:26:27 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:0 (Master dvs42832) Mar 09 09:26:27 [22178] dvs42832 pengine: notice: LogActions: Stop drbd_pgsql:1 (dvs47713) Mar 09 09:26:27 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_mail (Started dvs42832) Mar 09 09:26:27 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_fs (Started dvs42832) Mar 09 09:26:27 [22178] dvs42832 pengine: notice: LogActions: Start pgsql_lsb (dvs42832) Mar 09 09:26:27 [22178] dvs42832 pengine: notice: LogActions: Start pgsql_vip (dvs42832) Then: Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:0 (Master dvs42832) Mar 09 09:26:28 [22178] dvs42832 pengine: notice: LogActions: Start drbd_pgsql:1 (dvs47713) Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_mail (Started dvs42832) Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_fs (Started dvs42832) Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_lsb (Started dvs42832) Mar 09 09:26:28 [22178] dvs42832 pengine: notice: LogActions: Start pgsql_vip (dvs42832) Eventually: Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:0 (Master dvs42832) Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave drbd_pgsql:1 (Slave dvs47713) Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_mail (Started dvs42832) Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_fs (Started dvs42832) Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_lsb (Started dvs42832) Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave pgsql_vip (Started dvs42832) So it seems you have three problems: 1) some resource operation failing 2) network problems 3) no fencing configured Just adjusting some timeouts woun't help much in this situation. Regards, Ulrich >>> FLORAC Thierry <thierry.flo...@onf.fr> schrieb am 09.03.2022 um 18:24 in Nachricht <pr2p264mb076785e16fad8f054d972c0ef5...@pr2p264mb0767.frap264.prod.outlook.com>: > He is an extract of "corosync.log"... > > Thierry > > ________________________________ > De : Users <users-boun...@clusterlabs.org> de la part de Ulrich Windl > <ulrich.wi...@rz.uni-regensburg.de> > Envoyé : mercredi 9 mars 2022 17:13 > À : users@clusterlabs.org <users@clusterlabs.org> > Objet : [ClusterLabs] Antw: Re: Antw: [EXT] Cluster timeout > >>>> FLORAC Thierry <thierry.flo...@onf.fr> schrieb am 09.03.2022 um 16:56 in > Nachricht > <pr2p264mb07678dbb0517cb8c7695627cf5...@pr2p264mb0767.frap264.prod.outlook.co > M>: > >>>>> FLORAC Thierry <thierry.flo...@onf.fr> schrieb am 09.03.2022 um 11:46 in >> Nachricht >> > <pr2p264mb076751671fc57f33b995f851f5...@pr2p264mb0767.frap264.prod.outlook.co > >> M>: >> >>> Hi, >>> >>> I manage an active/passive PostgreSQL cluster using DRBD, LVM, Pacemaker > and >> >>> Corosync on a Debian GNU/Linux operating system. >>> Everything is OK, but my platform seems to be quite "sensitive" to small >>> network timeouts which are generating a cluster migration start from > active >> >>> to passive node; generally, the process doesn't go through to the end: as >>> soon as the connection is back again, the migration is cancelled and the >>> database restarts! >> >> Could it be you run without fencing? Maybe show some logs! >> >> Logs are quite verbose and not very easy to understand... >> What log would you need? > > Those showing what happens when the network goes down, and what happens when > the network comes up. > Usually the DC writes some good "action summaries" (typically after > "pacemaker-controld[7236]: notice: State transition S_IDLE -> > S_POLICY_ENGINE"). Those would be helpful. > >> >>> That should be OK but on the application side, some database connections > (on >> >>> a Java WildFly server) can become "invalid"! So I would like to avoid > these >> >>> migrations when this kind of small timeout occurs... >>> >>> So my question is: which cluster settings can I change to increase the >>> timeout before starting a cluster migration? >>> >>> Best regards, >>> Thierry >>> >>> >>> >>> Thierry Florac >>> Resp. Pôle Architecture Applicative et Mobile >>> DSI ‑ Dépt. Études et Solutions Tranverses >>> 2, avenue de Saint‑Mandé ‑ 75570 Paris cedex 12 >>> Tél : 01 40 19 59 64 >>> www.onf.fr <https://www.onf.f<https://www.onf.fr>r > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/