Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

Gabriele Bulfon Thu, 17 Dec 2020 06:19:00 -0800

Sorry, somtimes I want to make it simpler, and maybe I'm missing informations.
I think I found what happened, and actually xstha2 DID NOT mount the zpool, nor 
start the IP address.
 
Let's make a step back.
We have two ip resources, one is normally for xstha1, the other is normally for 
xstha2.
The zpool is normally for xstha1.
The two IPs are used to share NFS resources to a Proxmox cluster (that's why I 
call them NFS IPs).
The logic moves the zpool and then the IP to node xstha2, when xstha1 is not 
available, and vice versa.
 
I was confused by the "duplicated IP" message I've seen on xstha1 while xstha2 
was going to be stonished.
I was worried that xstha2 may have mounted the zpool when xstha1 had already 
mounted it.
Actually, reading again the "duplicated IP" message, it was xstha1 that (having 
the pool mounted and not seeing xstha2 anymore) got the xstha2 IP for NFS.
 
So I think there is no worry about the zpool!
I will now try to play with the "no-quorum-policy=ignore" to see if that 
actually works correctly with stonith enabled.
 
Thanks for your help!
Gabriele 
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

----------------------------------------------------------------------------------

Da: Andrei Borzenkov <arvidj...@gmail.com>
A: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org> 
Data: 17 dicembre 2020 9.50.54 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

On Thu, Dec 17, 2020 at 11:11 AM Gabriele Bulfon <gbul...@sonicle.com> wrote:
>
> Yes, sorry took same bash by mistake...here are the correct logs.
>
> Yes, xstha1 has delay 10s so that I'm giving him precedence, xstha2 has delay 
> 1s and will be stonished earlier.
> During the short time before xstha2 got powered off, I saw it had time to 
> turn on NFS IP (I saw duplicated IP on xstha1).

Again - please write so that others can understand you. How should we
know what "NFS IP" is supposed to be? You have two resources that
looks like they are IP related and neither of them has NFS in its
name: xstha1_san0_IP, xstha2_san0_IP. And even if they had NFS in
their names - which of two resources are you talking about?

According to logs from xstha1, it started to activate resources only
after stonith was confirmed

Dec 16 15:08:12 [708] stonith-ng: notice: log_operation:
Operation 'off' [1273] (call 4 from crmd.712) for host 'xstha2' with
device 'xstha2-stonith' returned: 0 (OK)
Dec 16 15:08:12 [708] stonith-ng: notice: remote_op_done:
Operation 'off' targeting xstha2 on xstha1 for
crmd.712@xstha1.e487e7cc: OK

It is possible that your IPMI/BMC/whatever implementation responds
with success before it actually completes this action. I have seen at
least some delays in the past. There is not really much that can be
done here except adding artificial delay to stonith resource agent.
You need to test IPMI functionality before using it in pacemaker.

In this case xstha1 may have configured xstha2_san0_IP resource before
xstha2 was down. This would explain duplicated IP.

> And becase configuration has "order zpool_data_order inf: zpool_data ( 
> xstha1_san0_IP )", that means xstha2 had imported the zpool for a small time 
> before being stonished, and this must never happen.

There is no indication in logs that pacemaker started or attempted to
start either of xstha1_san0_IP or zpool_sata on xstha2.

>
> What suggests me that resources were started on xstha2 (and duplicated IP is 
> an effect) are these logs portions of xstha2.

The resources xstha2_san0_IP *remained* started on xstha2. pacemaker
did not try to stop them at all, it had no reasons to do so.

> These tells me it could not turn off resources on xstha1 (correct, it 
> couldn't contact xstha1):
>
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> xstha1_san0_IP_stop_0 on xstha1 is unrunnable (offline)
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> zpool_data_stop_0 on xstha1 is unrunnable (offline)
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
>
> These tells me xstha2 took control of resources, that were actually running 
> on xstha1:
>
> Dec 16 15:08:56 [667] pengine: notice: LogAction: * Move xstha1_san0_IP ( 
> xstha1 -> xstha2 )
> Dec 16 15:08:56 [667] pengine: info: LogActions: Leave xstha2_san0_IP 
> (Started xstha2)
> Dec 16 15:08:56 [667] pengine: notice: LogAction: * Move zpool_data ( xstha1 
> -> xstha2 )
> Dec 16 15:08:56 [667] pengine: info: LogActions: Leave xstha1-stonith 
> (Started xstha2)
> Dec 16 15:08:56 [667] pengine: notice: LogAction: * Stop xstha2-stonith ( 
> xstha1 ) due to node availability
>

These lines are only action plan what pacemaker *will* do.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

Reply via email to