[ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

2021-03-02 Thread Ulrich Windl
>>> Eric Robinson  schrieb am 02.03.2021 um 19:26 in
Nachricht


>>  -Original Message-
>> From: Users  On Behalf Of Digimer
>> Sent: Monday, March 1, 2021 11:02 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> ; Ulrich Windl 
>> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence 
> '001db02a'"
...
>> >> Cloud fencing usually requires a higher timeout (20s reported here).
>> >>
>> >> Microsoft seems to suggest the following setup:
>> >>
>> >> # pcs property set stonith‑timeout=900
>> >
>> > But doesn't that mean the other node waits 15 minutes after stonith
>> > until it performs the first post-stonith action?
>>
>> No, it means that if there is no reply by then, the fence has failed. If
the
>> fence happens sooner, and the caller is told this, recovery begins very 
> shortly
>> after.

How would the fencing be confirmed? I don't know.


>>
> 
> Interesting. Since users often report application failure within 1-3 minutes

> and may engineers begin investigating immediately, a technician could end up

> connecting to a cluster node after the stonith command was called, and could

> conceivably bring a failed no back up manually, only to have Azure finally 
> get around to shooting it in the head. I don't suppose there's a way to 
> abort/cancel a STONITH operation that is in progress?

I think you have to decide: Let the cluster handle the problem, or let the
admin handle the problem, but preferrably not both.
I also think you cannot cancel a STONITH; you can only confirm it.

Regards,
Ulrich

...


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-03-02 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Jan Friesse
> Sent: Monday, March 1, 2021 3:27 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Andrei Borzenkov 
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
>
> > On 27.02.2021 22:12, Andrei Borzenkov wrote:
> >> On 27.02.2021 17:08, Eric Robinson wrote:
> >>>
> >>> I agree, one node is expected to go out of quorum. Still the question is,
> why didn't 001db01b take over the services? I just remembered that
> 001db01b has services running on it, and those services did not stop, so it
> seems that 001db01b did not lose quorum. So why didn't it take over the
> services that were running on 001db01a?
> >>
> >> That I cannot answer. I cannot reproduce it using similar configuration.
> >
> > Hmm ... actually I can.
> >
> > Two nodes ha1 and ha2 + qdevice. I blocked all communication *from*
> > ha1 (to be precise - all packets with ha1 source MAC are dropped).
> > This happened around 10:43:45. Now look:
> >
> > ha1 immediately stops all services:
> >
> > Feb 28 10:43:44 ha1 corosync[3692]:   [TOTEM ] A processor failed,
> > forming new configuration.
> > Feb 28 10:43:47 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2944) was formed. Members left: 2
> > Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] Failed to receive the
> > leave message. failed: 2
> > Feb 28 10:43:47 ha1 corosync[3692]:   [CPG   ] downlist left_list: 1
> > received
> > Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Node ha2 state is
> > now lost Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Removing
> > all ha2 attributes for peer loss Feb 28 10:43:47 ha1
> > pacemaker-attrd[3703]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1
> > pacemaker-based[3700]:  notice: Node ha2 state is now lost Feb 28
> > 10:43:47 ha1 pacemaker-based[3700]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1
> > pacemaker-controld[3705]:  warning: Stonith/shutdown of node ha2 was
> > not expected Feb 28 10:43:47 ha1 pacemaker-controld[3705]:  notice:
> > State transition S_IDLE -> S_POLICY_ENGINE Feb 28 10:43:47 ha1
> > pacemaker-fenced[3701]:  notice: Node ha2 state is now lost Feb 28
> > 10:43:47 ha1 pacemaker-fenced[3701]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache
> > Feb 28 10:43:48 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:48 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2948) was formed. Members
> > Feb 28 10:43:48 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:50 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:50 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2952) was formed. Members
> > Feb 28 10:43:50 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:51 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:51 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2956) was formed. Members
> > Feb 28 10:43:51 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Server didn't send echo
> > reply message on time Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Feb
> > 28 10:43:56 error Server didn't send echo reply message on time
> > Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] This node is within the
> > non-primary component and will NOT provide any services.
> > Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] Members[1]: 1
> > Feb 28 10:43:56 ha1 corosync[3692]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning: Quorum lost
> > Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  notice: Node ha2 state
> > is now lost Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning:
> > Stonith/shutdown of node ha2 was not expected Feb 28 10:43:56 ha1
> > pacemaker-controld[3705]:  notice: Updating quorum status to false
> > (call=274) Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  warning:
> > Fencing and resource management disabled due to lack of quorum Feb 28
> > 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_drbd0:0(Master ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_drbd1:0( Slave ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_fs_clust01 (   ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 

Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

2021-03-02 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Digimer
> Sent: Monday, March 1, 2021 11:02 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Ulrich Windl 
> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence '001db02a'"
> but It got fenced anyway
>
> On 2021-03-01 2:50 a.m., Ulrich Windl wrote:
>  Valentin Vidic  schrieb am
>  28.02.2021 um
> > 16:59
> > in Nachricht <20210228155921.gm29...@valentin-vidic.from.hr>:
> >> On Sun, Feb 28, 2021 at 03:34:20PM +, Eric Robinson wrote:
> >>> 001db02b rebooted. After it came back up, I tried it in the other
> > direction.
> >>>
> >>> On node 001db02b, the command...
> >>>
> >>> # pcs stonith fence 001db02a
> >>>
> >>> ...produced output...
> >>>
> >>> Error: unable to fence '001db02a'.
> >>>
> >>> However, node 001db02a did get restarted!
> >>>
> >>> We also saw this error...
> >>>
> >>> Failed Actions:
> >>> * stonith‑001db02ab_start_0 on 001db02a 'unknown error' (1):
> >>> call=70,
> >> status=Timed Out, exitreason='',
> >>> last‑rc‑change='Sun Feb 28 10:11:10 2021', queued=0ms,
> >>> exec=20014ms
> >>>
> >>> When that happens, does Pacemaker take over the other node's
> >>> resources, or
> >
> >> not?
> >>
> >> Cloud fencing usually requires a higher timeout (20s reported here).
> >>
> >> Microsoft seems to suggest the following setup:
> >>
> >> # pcs property set stonith‑timeout=900
> >
> > But doesn't that mean the other node waits 15 minutes after stonith
> > until it performs the first post-stonith action?
>
> No, it means that if there is no reply by then, the fence has failed. If the
> fence happens sooner, and the caller is told this, recovery begins very 
> shortly
> after.
>

Interesting. Since users often report application failure within 1-3 minutes 
and may engineers begin investigating immediately, a technician could end up 
connecting to a cluster node after the stonith command was called, and could 
conceivably bring a failed no back up manually, only to have Azure finally get 
around to shooting it in the head. I don't suppose there's a way to 
abort/cancel a STONITH operation that is in progress?

> >> # pcs stonith create rsc_st_azure fence_azure_arm username="login ID"
> >>   password="password" resourceGroup="resource group"
> tenantId="tenant ID"
> >>   subscriptionId="subscription id"
> >>
> >
> pcmk_host_map="prod‑cl1‑0:prod‑cl1‑0‑vm‑name;prod‑cl1‑1:prod‑cl1‑1‑vm‑
> name"
> >>   power_timeout=240 pcmk_reboot_timeout=900
> pcmk_monitor_timeout=120
> >>   pcmk_monitor_retries=4 pcmk_action_limit=3
> >>   op monitor interval=3600
> >>
> >>
> > https://docs.microsoft.com/en‑us/azure/virtual‑machines/workloads/sap/
> > high‑avai
> >
> >> lability‑guide‑rhel‑pacemaker
> >>
> >> ‑‑
> >> Valentin
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
> interested in the weight and convolutions of Einstein’s brain than in the near
> certainty that people of equal talent have lived and died in cotton fields and
> sweatshops." - Stephen Jay Gould
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] "Error: unable to fence '001db02a'" but It got fenced anyway

2021-03-02 Thread Klaus Wenninger
On 3/1/21 8:48 AM, Ulrich Windl wrote:
 Eric Robinson  schrieb am 28.02.2021 um 16:34 in
> Nachricht
> 
>
>> I just configured STONITH in Azure for the first time. My initial test went
>> fine.
>>
>> On node 001db02a, the command...
>>
>> # pcs stonith fence 001db02b
>>
>> ...produced output...
>>
>> 001db02b fenced.
>>
>> 001db02b rebooted. After it came back up, I tried it in the other
> direction.
>> On node 001db02b, the command...
>>
>> # pcs stonith fence 001db02a
>>
>> ...produced output...
>>
>> Error: unable to fence '001db02a'.
>>
>> However, node 001db02a did get restarted!
>>
>> We also saw this error...
>>
>> Failed Actions:
>> * stonith‑001db02ab_start_0 on 001db02a 'unknown error' (1): call=70, 
>> status=Timed Out, exitreason='',
>> last‑rc‑change='Sun Feb 28 10:11:10 2021', queued=0ms, exec=20014ms
>>
>> When that happens, does Pacemaker take over the other node's resources, or 
>> not?
> In case you are debugging an "unsuccessful" stonith that actually reboots the
> node, you might try something likle "journalctl -f" on the node to be fenced,
> just in case the last messages are not written to disk.
> Generally no stonith can send a confirmation response _after_ stonith was
> successful, so usually it works with timeouts (assuming the node is down n
> seconds after issuing stonith to it).
Actually this is just true for cases where you don't have
a classical stonith-device like SBD, ssh-fencing, ... and
you have to be careful in these cases.
A stonith-device like IPMI, powerplug with IP-connection, ...
will give feedback that it has successfully turned off power.
>
>>
>> [cid:image001.png@01D70DB2.BF2769D0]
>>
>> Disclaimer : This email and any files transmitted with it are confidential 
>> and intended solely for intended recipients. If you are not the named 
>> addressee you should not disseminate, distribute, copy or alter this email.
>> Any views or opinions presented in this email are solely those of the author
>> and might not represent those of Physician Select Management. Warning: 
>> Although Physician Select Management has taken reasonable precautions to 
>> ensure no viruses are present in this email, the company cannot accept 
>> responsibility for any loss or damage arising from the use of this email or
>> attachments.
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: VirtualDomain RA

2021-03-02 Thread Roger Zhou



On 3/1/21 7:17 PM, Ulrich Windl wrote:

Hi!

I have a question about the VirtualDomain RA (as in SLES15 SP2):
Why does the RA "undefine", then "create" a domain instead of just "start"ing a 
domain?
I mean: Assuming that an "installation" does "define" the domains, why bother with configuration 
files and "create" when a simple "start" would do also?



It makes sense to clean up the certain situation to avoid the "create" failure. 
Example:


1. given foobar.xml doesn't provide UUID
2. `virsh define foobar.xml`
3. `virsh create foobar.xml` <-- error: Failed to create domain from foobar.xml

Cheers,
Roger



Specifically this code:
verify_undefined() {
 local tmpf
 if virsh --connect=${OCF_RESKEY_hypervisor} list --all --name 2>/dev/null | grep 
-wqs "$DOMAIN_NAME"
 then
 tmpf=$(mktemp -t vmcfgsave.XX)
 if [ ! -r "$tmpf" ]; then
 ocf_log warn "unable to create temp file, disk full?"
 # we must undefine the domain
 virsh $VIRSH_OPTIONS undefine $DOMAIN_NAME > /dev/null 
2>&1
 else
 cp -p $OCF_RESKEY_config $tmpf
 virsh $VIRSH_OPTIONS undefine $DOMAIN_NAME > /dev/null 
2>&1
 [ -f $OCF_RESKEY_config ] || cp -f $tmpf 
$OCF_RESKEY_config
 rm -f $tmpf
 fi
 fi
}

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/