[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-22 Thread Lentes, Bernd
Hi guys,

ocassionally stopping a VirtualDomain resource via "crm resource stop" does not 
work, and in the end the node is fenced, which is ugly.
I had a look at the RA to see what it does. After trying to stop the domain via 
"virsh shutdown ..." in a configurable time it switches to "virsh destroy".
i assume "virsh destroy" send a sigkill to the respective process. But when the 
host is doing heavily IO it's possible that the process is in "D" state 
(uninterruptible sleep) 
in which it can't be finished with a SIGKILL. The the node the domain is 
running on is fenced due to that.
I digged deeper and found out that the signal is often delivered a bit later 
(just some seconds) and the process is killed, but pacemaker already decided to 
fence the node.
It's all about this excerp in the RA:

force_stop()
{
local out ex translate
local status=0

ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
ex=$?
translate=$(echo $out|tr 'A-Z' 'a-z')
echo >&2 "$translate"
case $ex$translate in
*"error:"*"domain is not running"*|*"error:"*"domain not 
found"*|\
*"error:"*"failed to get domain"*)
: ;; # unexpected path to the intended outcome, all is 
well
[!0]*)
ocf_exit_reason "forced stop failed"
return $OCF_ERR_GENERIC ;;
0*)
while [ $status != $OCF_NOT_RUNNING ]; do
VirtualDomain_status
status=$?
done ;;
esac
return $OCF_SUCCESS
}

I'm thinking about the following:
How about to let the script wait a bit after "virsh destroy". I saw that 
usually it just takes some seconds that "virsh destroy" is successfull.
I'm thinking about this change:

 ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
ex=$?
sleep (10)< (or maybe configurable)
translate=$(echo $out|tr 'A-Z' 'a-z')


What do you think ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-22 Thread Andrei Borzenkov
22.10.2020 23:29, Lentes, Bernd пишет:
> Hi guys,
> 
> ocassionally stopping a VirtualDomain resource via "crm resource stop" does 
> not work, and in the end the node is fenced, which is ugly.
> I had a look at the RA to see what it does. After trying to stop the domain 
> via "virsh shutdown ..." in a configurable time it switches to "virsh 
> destroy".
> i assume "virsh destroy" send a sigkill to the respective process. But when 
> the host is doing heavily IO it's possible that the process is in "D" state 
> (uninterruptible sleep) 
> in which it can't be finished with a SIGKILL. The the node the domain is 
> running on is fenced due to that.
> I digged deeper and found out that the signal is often delivered a bit later 
> (just some seconds) and the process is killed, but pacemaker already decided 
> to fence the node.
> It's all about this excerp in the RA:
> 
> force_stop()
> {
> local out ex translate
> local status=0
> 
> ocf_log info "Issuing forced shutdown (destroy) request for domain 
> ${DOMAIN_NAME}."
> out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
> ex=$?
> translate=$(echo $out|tr 'A-Z' 'a-z')
> echo >&2 "$translate"
> case $ex$translate in
> *"error:"*"domain is not running"*|*"error:"*"domain not 
> found"*|\
> *"error:"*"failed to get domain"*)
> : ;; # unexpected path to the intended outcome, all 
> is well
> [!0]*)
> ocf_exit_reason "forced stop failed"
> return $OCF_ERR_GENERIC ;;
> 0*)
> while [ $status != $OCF_NOT_RUNNING ]; do
> VirtualDomain_status
> status=$?
> done ;;
> esac
> return $OCF_SUCCESS
> }
> 
> I'm thinking about the following:
> How about to let the script wait a bit after "virsh destroy". I saw that 
> usually it just takes some seconds that "virsh destroy" is successfull.
> I'm thinking about this change:
> 
>  ocf_log info "Issuing forced shutdown (destroy) request for domain 
> ${DOMAIN_NAME}."
> out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
> ex=$?
> sleep (10)< (or maybe configurable)
> translate=$(echo $out|tr 'A-Z' 'a-z')
> 
> 
> What do you think ?
> 


It makes no difference. You wait 10 seconds before parsing output of
"virsh destroy", that's all. It does not change output itself, so if
output indicates that "virsh destroy" failed, it will still indicate
that after 10 seconds.

Either you need to repeat "virsh destroy" in a loop, or virsh itself
should be more robust.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 7:11 AM, Andrei Borzenkov arvidj...@gmail.com wrote:


>> 
>>  ocf_log info "Issuing forced shutdown (destroy) request for domain
>>  ${DOMAIN_NAME}."
>> out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>> ex=$?
>> sleep (10)< (or maybe configurable)
>> translate=$(echo $out|tr 'A-Z' 'a-z')
>> 
>> 
>> What do you think ?
>> 
> 
> 
> It makes no difference. You wait 10 seconds before parsing output of
> "virsh destroy", that's all. It does not change output itself, so if
> output indicates that "virsh destroy" failed, it will still indicate
> that after 10 seconds.
> 
> Either you need to repeat "virsh destroy" in a loop, or virsh itself
> should be more robust.

Hi Andrei,

yes, you are right. I saw it alread after sending the E-Mail.
I will change that.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 5:06 PM, Strahil Nikolov hunter86...@yahoo.com wrote:

> why don't you work with something like this: 'op stop interval =300
> timeout=600'.
> The stop operation will timeout at your requirements without modifying the
> script.
> 
> Best Regards,
> Strahil Nikolov

But when the timeout has run out the RA tries to kill the machine with a "virsh 
destroy".
And if that does not work (what is occasionally my problem) because the domain
is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back which
cause pacemaker to fence the lazy node. Or am i wrong ?
Where is the benefit of the shorter interval ?

The return value of the "virsh destroy" operation is set immediately.
And it's -ne 0 when the "virsh destroy" didn't suceed.
No matter if the domain stops 20 sec. later, the return value is not changed.
and send to the LRM so the cluster wants to stonith that node.

Surprisingly if the virsh destroy is successfull the RA waits until the domain 
isn't running anymore:

force_stop
{
 ...

  0*)
while [ $status != $OCF_NOT_RUNNING ]; do
VirtualDomain_status
status=$?
done ;;

I need someting like that which waits for some time (maybe 30s) if the domain 
nevertheless stops although
"virsh destroy" gaves an error back. Because the SIGKILL is delivered if the 
process wakes up from D state.
For this amount of time the RA has to wait and to take care that the the return 
value is zero if the domain stopped or
is -ne 0 if also the waiting didn't help.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Strahil Nikolov
why don't you work with something like this: 'op stop interval =300 
timeout=600'.
The stop operation will timeout at your requirements without modifying the 
script.

Best Regards,
Strahil Nikolov






В четвъртък, 22 октомври 2020 г., 23:30:08 Гринуич+3, Lentes, Bernd 
 написа: 





Hi guys,

ocassionally stopping a VirtualDomain resource via "crm resource stop" does not 
work, and in the end the node is fenced, which is ugly.
I had a look at the RA to see what it does. After trying to stop the domain via 
"virsh shutdown ..." in a configurable time it switches to "virsh destroy".
i assume "virsh destroy" send a sigkill to the respective process. But when the 
host is doing heavily IO it's possible that the process is in "D" state 
(uninterruptible sleep) 
in which it can't be finished with a SIGKILL. The the node the domain is 
running on is fenced due to that.
I digged deeper and found out that the signal is often delivered a bit later 
(just some seconds) and the process is killed, but pacemaker already decided to 
fence the node.
It's all about this excerp in the RA:

force_stop()
{
        local out ex translate
        local status=0

        ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        translate=$(echo $out|tr 'A-Z' 'a-z')
        echo >&2 "$translate"
        case $ex$translate in
                *"error:"*"domain is not running"*|*"error:"*"domain not 
found"*|\
                *"error:"*"failed to get domain"*)
                        : ;; # unexpected path to the intended outcome, all is 
well
                [!0]*)
                        ocf_exit_reason "forced stop failed"
                        return $OCF_ERR_GENERIC ;;
                0*)
                        while [ $status != $OCF_NOT_RUNNING ]; do
                                VirtualDomain_status
                                status=$?
                        done ;;
        esac
        return $OCF_SUCCESS
}

I'm thinking about the following:
How about to let the script wait a bit after "virsh destroy". I saw that 
usually it just takes some seconds that "virsh destroy" is successfull.
I'm thinking about this change:

ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        sleep (10)    < (or maybe configurable)
        translate=$(echo $out|tr 'A-Z' 'a-z')


What do you think ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Andrei Borzenkov
23.10.2020 21:08, Lentes, Bernd пишет:
> 
> Surprisingly if the virsh destroy is successfull the RA waits until the 
> domain isn't running anymore:
> 
...
> 
> I need someting like that which waits for some time (maybe 30s) if the domain 
> nevertheless stops although
> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the 
> process wakes up from D state.

So why not ignore virsh error and just wait always? You probably need to
retain "domain not found" exit condition still.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 8:45 PM, Valentin Vidić vvi...@valentin-vidic.from.hr 
wrote:

> On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote:
>> But when the timeout has run out the RA tries to kill the machine with a 
>> "virsh
>> destroy".
>> And if that does not work (what is occasionally my problem) because the 
>> domain
>> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back 
>> which
>> cause pacemaker to fence the lazy node. Or am i wrong ?
> 
> What does the log look like when this happens?
> 

/var/log/cluster/corosync.log:

VirtualDomain(vm_amok)[8998]:   2020/09/27_22:34:11 INFO: Issuing graceful 
shutdown request for domain vm_amok.

VirtualDomain(vm_amok)[8998]:   2020/09/27_22:37:06 INFO: Issuing forced 
shutdown (destroy) request for domain vm_amok.
Sep 27 22:37:11 [11282] ha-idg-2   lrmd:  warning: child_timeout_callback:  
vm_amok_stop_0 process (PID 8998) timed out
Sep 27 22:37:11 [11282] ha-idg-2   lrmd:  warning: operation_finished:  
vm_amok_stop_0:8998 - timed out after 18ms
  timeout of the domain is 180 sec.

/var/log/libvirt/libvirtd.log (time is UTC):

2020-09-27 20:37:21.489+: 18583: error : virProcessKillPainfully:401 : 
Failed to terminate process 14037 with SIGKILL: Device or resource busy
2020-09-27 20:37:21.505+: 6610: error : virNetSocketWriteWire:1852 : Cannot 
write data: Broken pipe
2020-09-27 20:37:31.962+: 6610: error : qemuMonitorIO:719 : internal error: 
End of file from qemu monitor

SIGKILL didn't work. Nevertheless the process is finished 20 seconds later 
after destroy, surely because it woke up from D and received the signal.

/var/log/cluster/corosync.log on the DC:

Sep 27 22:37:11 [3580] ha-idg-1   crmd:  warning: status_from_rc:   Action 
93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
  Stop (also sigkill) failed
Sep 27 22:37:11 [3579] ha-idg-1pengine:   notice: native_stop_constraints:  
Stop of failed resource vm_amok is implicit after ha-idg-2 is fenced
  cluster decides to fence the node although resource is stopped 10 seconds 
later

atop log:
14037  - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name 
guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ...
  PID of the domain is 14037

14037  - E   0% worker   (at 22:37:31)
  domain has stoppped


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote:


>> I need someting like that which waits for some time (maybe 30s) if the domain
>> nevertheless stops although
>> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the
>> process wakes up from D state.
> 
> So why not ignore virsh error and just wait always? You probably need to
> retain "domain not found" exit condition still.

That's my plan.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Valentin Vidić
On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote:
> But when the timeout has run out the RA tries to kill the machine with a 
> "virsh destroy".
> And if that does not work (what is occasionally my problem) because the domain
> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back 
> which
> cause pacemaker to fence the lazy node. Or am i wrong ?

What does the log look like when this happens?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-26 Thread Lentes, Bernd


- On Oct 23, 2020, at 11:18 PM, Bernd Lentes 
bernd.len...@helmholtz-muenchen.de wrote:

> - On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote:
> 
> 
>>> I need someting like that which waits for some time (maybe 30s) if the 
>>> domain
>>> nevertheless stops although
>>> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the
>>> process wakes up from D state.
>> 
>> So why not ignore virsh error and just wait always? You probably need to
>> retain "domain not found" exit condition still.

Hi,

here is my rewritten RA: 
https://hmgubox2.helmholtz-muenchen.de/index.php/s/iYjRyJiWb5XNfXm

As i'm not a coder feedback is welcome.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/