[ClusterLabs] Antw: [EXT] Re: Antw: Another word of warning regarding VirtualDomain and Live Migration

2020-12-16 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 16.12.2020 um 20:13 in 
>>> Nachricht
:

...
>> So the cluster is exactly doing the wrong thing: The VM ist still
>> active on h16, while a "recovery" on h19 will start it there! So
>> _after_ the recovery the VM is duplicate.
> 
> The problem here is that a stop should be scheduled on both nodes, not
> just one of them. Then the start is scheduled on only one node.
> 
> Do you have the pe input from this transition?

I hope the one attached is the right one ;-)

Regards,
Ulrich




pe-error-5.bz2
Description: Binary data
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: crm resource start/stop completion

2020-12-16 Thread Ulrich Windl
Hi!

Thanks, seems I have to wait until it comes out from the SUSE distillery ;-)

Regards,
Ulrich

>>> Xin Liang  schrieb am 16.12.2020 um 18:09 in Nachricht


> Hi Ulrich,
> 
> In https://github.com/ClusterLabs/crmsh/pull/273/files, the 4.2.0 definitely

> already included this change.
> 
> In my clean environment(0 resources configured cluster):
> 
>   1.  crm configure primitive d Dummy   (then resource d will started)
>   2.  crm resource stop (or  after stop, then you
will 
> find d completed)
>   3.  crm resource start (or  after start, then
you got d 
> completed)
> 
> 
> From: Users  on behalf of Ulrich Windl 
> 
> Sent: Wednesday, December 16, 2020 5:57 PM
> To: users@clusterlabs.org 
> Subject: [ClusterLabs] Antw: [EXT] Re: Q: crm resource start/stop
completion
> 
 Xin Liang  schrieb am 15.12.2020 um 03:55 in Nachricht
>
om>
> 
>> Hi Ulrich,
>>
>> crmsh already could do this completion:)
> 
> Hi!
> 
> Since when? Mine (crmsh-4.2.0+git.1604052559.2a348644-5.26.1.noarch) seems
> unable: When I use "crm(live/h16)resource# start " and press TAB, I see
> resources that are running already. May be that they don't have a 
> target-role
> (so they are started by default).
> 
> Can you confirm?
> 
> Regards,
> Ulrich
> 
>>
>> Regards,
>> Xin
>> 
>> From: Users  on behalf of Ulrich Windl
>> 
>> Sent: Monday, December 14, 2020 6:51 PM
>> To: users@clusterlabs.org 
>> Subject: [ClusterLabs] Q: crm resource start/stop completion
>>
>> Hi!
>>
>> I wonder: Would it be difficult and would it make sense to change crm
shell
> 
>> to:
>> Complete "resource start " with only those resources that aren't running
>> (per role) already
>> Complete "resource stopt " with only those resources that are running (per
>> role)
>>
>> I came to this after seeing "3 disabled resources" in crm_mon output. As
>> crm_mon itself cannot display just those disabled resources (yet? ;‑)), I
> used
>> crm shell...
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: crm shell: "params param"?

2020-12-16 Thread Ulrich Windl
>>> Xin Liang  schrieb am 16.12.2020 um 17:49 in Nachricht


> Hi Ulrich,
> 
>> params param config="/etc/libvirt/libxl/test‑jeos.xml
> 
> It seems should raise an error while parse input from command line, since 
> "param" as a RA parameter was not exist

Indeed I found out that "param" (most likely caused by some copy error)
is a parameter named "param" that has no value ;-)
So thats "params param"...

> 
> Thanks for report!
> 
> Regards,
> Xin
> 
> From: Users  on behalf of Ulrich Windl 
> 
> Sent: Wednesday, December 16, 2020 10:23 PM
> To: users@clusterlabs.org ; Roger Zhou

> Subject: [ClusterLabs] crm shell: "params param"?
> 
 Ulrich Windl schrieb am 16.12.2020 um 14:53 in Nachricht <5FDA1167.92C :
161 
> :
> 60728>:
> 
> [...]
>> primitive prm_xen_test‑jeos VirtualDomain \
>> params param config="/etc/libvirt/libxl/test‑jeos.xml"
> 
> BTW: Is "params param" a bug in crm shell? In older crm shell I only see 
> "params"...
> 
>> hypervisor="xen:///system" autoset_utilization_cpu=false
>> autoset_utilization_hv_memory=false \
>> op start timeout=120 interval=0 \
>> op stop timeout=180 interval=0 \
>> op monitor interval=600 timeout=90 \
>> op migrate_to timeout=300 interval=0 \
>> op migrate_from timeout=300 interval=0 \
>> utilization utl_cpu=20 utl_ram=2048 \
>> meta priority=123 allow‑migrate=true
>>
> [...]
> 
> Regards,
> Ulrich
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-16 Thread Ulrich Windl
>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:56 in
Nachricht <386755316.773.1608130588146@www>:
> Thanks, here are the logs, there are infos about how it tried to start 
> resources on the nodes.
> Keep in mind the node1 was already running the resources, and I simulated a 
> problem by turning down the ha interface.

Please note that "turning down" an interface is NOT a realistic test; realistic 
would be to unplug the cable.

>  
> Gabriele
>  
>  
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>  
> 
> 
> 
> 
> 
> --
> 
> Da: Ulrich Windl 
> A: users@clusterlabs.org 
> Data: 16 dicembre 2020 15.45.36 CET
> Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource
> 
> 
 Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
> Nachricht <1523391015.734.1608129155836@www>:
>> Hi, I have now a two node cluster using stonith with different 
>> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
>> problems.
>> 
>> Though, there is still one problem: once node 2 delays its stonith action 
>> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> 
>> so it happens that while it's not yet powered off by node 1 (and waiting its 
> 
>> dalay to power off node 1) it actually starts resources, causing a moment of 
> 
>> few seconds where both NFS IP and ZFS pool (!) is mounted by both!
> 
> AFAIK pacemaker will not start resources on a node that is scheduled for 
> stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
> for stonith to start them elsewhere.
> 
>> How can I delay node 2 resource start until the delayed stonith action is 
>> done? Or how can I just delay the resource start so I can make it larger 
> than 
>> its pcmk_delay_base?
> 
> We probably need to see logs and configs to understand.
> 
>> 
>> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
>> to set this flag (cib-bootstrap-options is not happy with it...).
> 
> I think it's on by default, so you must have set it to false.
> In crm shell it is "configure# property stonith-enabled=...".
> 
> Regards,
> Ulrich
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Q: crm resource start/stop completion

2020-12-16 Thread Xin Liang
Hi Ulrich,

Sorry, I confirmed that the completion might not correct for resource group.
Already raise a related bug.

Thank you!

Regards,
Xin


From: Users  on behalf of Ulrich Windl 

Sent: Wednesday, December 16, 2020 5:57 PM
To: users@clusterlabs.org 
Subject: [ClusterLabs] Antw: [EXT] Re: Q: crm resource start/stop completion

>>> Xin Liang  schrieb am 15.12.2020 um 03:55 in Nachricht


> Hi Ulrich,
>
> crmsh already could do this completion:)

Hi!

Since when? Mine (crmsh-4.2.0+git.1604052559.2a348644-5.26.1.noarch) seems
unable: When I use "crm(live/h16)resource# start " and press TAB, I see
resources that are running already. May be that they don't have a target-role
(so they are started by default).

Can you confirm?

Regards,
Ulrich

>
> Regards,
> Xin
> 
> From: Users  on behalf of Ulrich Windl
> 
> Sent: Monday, December 14, 2020 6:51 PM
> To: users@clusterlabs.org 
> Subject: [ClusterLabs] Q: crm resource start/stop completion
>
> Hi!
>
> I wonder: Would it be difficult and would it make sense to change crm shell

> to:
> Complete "resource start " with only those resources that aren't running
> (per role) already
> Complete "resource stopt " with only those resources that are running (per
> role)
>
> I came to this after seeing "3 disabled resources" in crm_mon output. As
> crm_mon itself cannot display just those disabled resources (yet? ;‑)), I
used
> crm shell...
>
> Regards,
> Ulrich
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Andrei Borzenkov
16.12.2020 19:05, Gabriele Bulfon пишет:
> Looking at the two logs, looks like corosync decided that xst1 was offline, 
> while xst was still online.
> I just issued an "ifconfig ha0 down" on xst1, so I expect both nodes cannot 
> see other one, while I see these same lines both on xst1 and xst2 log:
>  
> ec 16 15:08:56 [667]    pengine:  warning: pe_fence_node:      Cluster node 
> xstha1 will be fenced: peer is no longer part of the cluster

You should pay more attention to what you write. While occasional typos
of course happen, you consistently use wrong names for nodes which do
not match actual logs. If you continue this way nobody will be able to
follow.

> 
> why xst2 and not xst1?

Logs mention neither xst1 not xst2.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Andrei Borzenkov
16.12.2020 17:56, Gabriele Bulfon пишет:
> Thanks, here are the logs, there are infos about how it tried to start 
> resources on the nodes.

Both logs are from the same node.

> Keep in mind the node1 was already running the resources, and I simulated a 
> problem by turning down the ha interface.
>  

There is no attempt to start resources in these logs. Logs end with
stonith request. As this node had delay 10s, it probably was
successfully eliminated by another node, but there are no logs from
another node.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Q: crm resource start/stop completion

2020-12-16 Thread Xin Liang
Hi Ulrich,

In https://github.com/ClusterLabs/crmsh/pull/273/files, the 4.2.0 definitely 
already included this change.

In my clean environment(0 resources configured cluster):

  1.  crm configure primitive d Dummy   (then resource d will started)
  2.  crm resource stop (or  after stop, then you 
will find d completed)
  3.  crm resource start (or  after start, then you 
got d completed)


From: Users  on behalf of Ulrich Windl 

Sent: Wednesday, December 16, 2020 5:57 PM
To: users@clusterlabs.org 
Subject: [ClusterLabs] Antw: [EXT] Re: Q: crm resource start/stop completion

>>> Xin Liang  schrieb am 15.12.2020 um 03:55 in Nachricht


> Hi Ulrich,
>
> crmsh already could do this completion:)

Hi!

Since when? Mine (crmsh-4.2.0+git.1604052559.2a348644-5.26.1.noarch) seems
unable: When I use "crm(live/h16)resource# start " and press TAB, I see
resources that are running already. May be that they don't have a target-role
(so they are started by default).

Can you confirm?

Regards,
Ulrich

>
> Regards,
> Xin
> 
> From: Users  on behalf of Ulrich Windl
> 
> Sent: Monday, December 14, 2020 6:51 PM
> To: users@clusterlabs.org 
> Subject: [ClusterLabs] Q: crm resource start/stop completion
>
> Hi!
>
> I wonder: Would it be difficult and would it make sense to change crm shell

> to:
> Complete "resource start " with only those resources that aren't running
> (per role) already
> Complete "resource stopt " with only those resources that are running (per
> role)
>
> I came to this after seeing "3 disabled resources" in crm_mon output. As
> crm_mon itself cannot display just those disabled resources (yet? ;‑)), I
used
> crm shell...
>
> Regards,
> Ulrich
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm shell: "params param"?

2020-12-16 Thread Xin Liang
Hi Ulrich,

> params param config="/etc/libvirt/libxl/test-jeos.xml

It seems should raise an error while parse input from command line, since 
"param" as a RA parameter was not exist

Thanks for report!

Regards,
Xin

From: Users  on behalf of Ulrich Windl 

Sent: Wednesday, December 16, 2020 10:23 PM
To: users@clusterlabs.org ; Roger Zhou 
Subject: [ClusterLabs] crm shell: "params param"?

>>> Ulrich Windl schrieb am 16.12.2020 um 14:53 in Nachricht <5FDA1167.92C : 
>>> 161 :
60728>:

[...]
> primitive prm_xen_test-jeos VirtualDomain \
> params param config="/etc/libvirt/libxl/test-jeos.xml"

BTW: Is "params param" a bug in crm shell? In older crm shell I only see 
"params"...

> hypervisor="xen:///system" autoset_utilization_cpu=false
> autoset_utilization_hv_memory=false \
> op start timeout=120 interval=0 \
> op stop timeout=180 interval=0 \
> op monitor interval=600 timeout=90 \
> op migrate_to timeout=300 interval=0 \
> op migrate_from timeout=300 interval=0 \
> utilization utl_cpu=20 utl_ram=2048 \
> meta priority=123 allow-migrate=true
>
[...]

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Looks like my installation doesn't like this attribute:
 
sonicle@xstorage1:/sonicle/var/log/cluster# crm configure property 
stonith-enabled=true
ERROR: Warnings found during check: config may not be valid
Do you still want to commit (y/n)? n
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Looking at the two logs, looks like corosync decided that xst1 was offline, 
while xst was still online.
I just issued an "ifconfig ha0 down" on xst1, so I expect both nodes cannot see 
other one, while I see these same lines both on xst1 and xst2 log:
 
ec 16 15:08:56 [667]    pengine:  warning: pe_fence_node:      Cluster node 
xstha1 will be fenced: peer is no longer part of the cluster
Dec 16 15:08:56 [667]    pengine:  warning: determine_online_status:    Node 
xstha1 is unclean
Dec 16 15:08:56 [667]    pengine:     info: determine_online_status_fencing:    
Node xstha2 is active
Dec 16 15:08:56 [667]    pengine:     info: determine_online_status:    Node 
xstha2 is online
 
why xst2 and not xst1?
I would expect no action at all in this case, until stonith is done...
While it goes on with :
 
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha1_san0_IP_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
zpool_data_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
 
trying to stop everythin on xst1 (but it's not runnable).
Then:
 
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       
xstha1_san0_IP     ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha2_san0_IP  
(Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       
zpool_data         ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha1-stonith  
(Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Stop       
xstha2-stonith     (           xstha1 )   due to node availability
 
as if xst2 has been elected to be the running node, not knowing xst1 will kill 
xst2 within few seconds.
 
What is wrong here?
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

 


Da: Gabriele Bulfon 
A: Cluster Labs - All topics related to open-source clustering welcomed 

Data: 16 dicembre 2020 15.56.28 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource



 
Thanks, here are the logs, there are infos about how it tried to start 
resources on the nodes.
Keep in mind the node1 was already running the resources, and I simulated a 
problem by turning down the ha interface.
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/

<>
<>___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

2020-12-16 Thread Ken Gaillot
On Wed, 2020-12-16 at 10:06 +0100, Ulrich Windl wrote:
> Hi!
> 
> (I changed the subject of the thread)
> VirtualDomain seems to be broken, as it does not handle a failed
> live-,igration correctly:
> 
> With my test-VM running on node h16, this happened when I tried to
> move it away (for testing):
> 
> Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  *
> Migrateprm_xen_test-jeos( h16 -> h19 )
> Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating
> migrate_to operation prm_xen_test-jeos_migrate_to_0 on h16
> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840
> aborted by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16:
> Event failed
> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840
> action 115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but
> got 'error'
> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected
> result (error: test-jeos: live migration to h19 failed: 1) was
> recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16
> 09:28:46 2020
> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected
> result (error: test-jeos: live migration to h19 failed: 1) was
> recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16
> 09:28:46 2020
> ### (note the message above is duplicate!)

A bit confusing, but that's because the operation is recorded twice,
once on its own, and once as "last_failure". If the operation later
succeeds, the "on its own" entry will be overwritten by the success,
but the "last_failure" will stick around until the resource is cleaned
up. That's how failures can continue to be shown in status after a
later success.

> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource
> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
> ### This is nonsense after a failed live migration!

That's just wording; it should probably say "could be active". From
Pacemaker's point of view, since it doesn't have any service-specific
intelligence, a failed migration might have left the resource active on
one node, the other, or both.

> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  *
> Recoverprm_xen_test-jeos( h19 )
> 
> 
> So the cluster is exactly doing the wrong thing: The VM ist still
> active on h16, while a "recovery" on h19 will start it there! So
> _after_ the recovery the VM is duplicate.

The problem here is that a stop should be scheduled on both nodes, not
just one of them. Then the start is scheduled on only one node.

Do you have the pe input from this transition?

> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating
> stop operation prm_xen_test-jeos_stop_0 locally on h19
> Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO:
> Domain test-jeos already stopped.
> Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos 
> stop (call 372, PID 20620) exited with status 0 (execution time
> 283ms, queue time 0ms)
> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop
> operation for prm_xen_test-jeos on h19: ok
> Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating
> start operation prm_xen_test-jeos_start_0 locally on h19
> 
> Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos 
> start (call 373, PID 21005) exited with status 0 (execution time
> 2715ms, queue time 0ms)
> Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of
> start operation for prm_xen_test-jeos on h19: ok
> Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected
> result (error: test-jeos: live migration to h19 failed: 1) was
> recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16
> 09:28:46 2020
> 
> Amazingly manual migration using virsh worked:
> virsh migrate --live test-jeos xen+tls://h18...
> 
> Regards,
> Ulrich Windl
> 
> 
> > > > Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht
> > > > <5FD774CF.8DE : 161 :
> 
> 60728>:
> > Hi!
> > 
> > I think I found the problem why a VM ist started on two nodes:
> > 
> > Live-Migration had failed (e.g. away from h16), so the cluster uses
> > stop and 
> > start (stop on h16, start on h19 for example).
> > When rebooting h16, I see these messages (h19 is DC):
> > 
> > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning:
> > Unexpected result 
> > (error: test-jeos: live migration to h16 failed: 1) was recorded
> > for 
> > migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
> > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource 
> > prm_xen_test-jeos is active on 2 nodes (attempting recovery)
> > 
> > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  *
> > Restart
> > prm_xen_test-jeos( h16 )
> > 
> > THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless
> > there was 
> > some autostart from libvirt. " virsh list --autostart" does not
> > list any)
> 

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Ken Gaillot
On Wed, 2020-12-16 at 15:56 +0100, Gabriele Bulfon wrote:
> Thanks, here are the logs, there are infos about how it tried to
> start resources on the nodes.
> Keep in mind the node1 was already running the resources, and I
> simulated a problem by turning down the ha interface.
>  
> Gabriele

>From the logs, Pacemaker is scheduling resource recovery after fencing
(which means stonith-enabled must already be true, by the way). I don't
know how you could see resources start without fencing succeeding
first.

Have you tested the fence devices themselves? E.g. manually run the
fence agent with the same parameters, or run "stonith_admin --reboot
". It's possible the fence device is returning success without
actually doing the fencing, though I'm not sure how that would happen
either.

BTW if you're using corosync < 3, turning down the interface isn't a
good test. Physically pulling the cable, or using the firewall to block
both incoming and outgoing packets on the interface, is better.

>  
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>  
> 
> 
> 
> ---
> ---
> 
> Da: Ulrich Windl 
> A: users@clusterlabs.org 
> Data: 16 dicembre 2020 15.45.36 CET
> Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource
> 
> > >>> Gabriele Bulfon  schrieb am 16.12.2020 um
> > 15:32 in
> > Nachricht <1523391015.734.1608129155836@www>:
> > > Hi, I have now a two node cluster using stonith with different 
> > > pcmk_delay_base, so that node 1 has priority to stonith node 2 in
> > case of 
> > > problems.
> > > 
> > > Though, there is still one problem: once node 2 delays its
> > stonith action 
> > > for 10 seconds, and node 1 just 1, node 2 does not delay start of
> > resources, 
> > > so it happens that while it's not yet powered off by node 1 (and
> > waiting its 
> > > dalay to power off node 1) it actually starts resources, causing
> > a moment of 
> > > few seconds where both NFS IP and ZFS pool (!) is mounted by
> > both!
> > 
> > AFAIK pacemaker will not start resources on a node that is
> > scheduled for stonith. Even more: Pacemaker will tra to stop
> > resources on a node scheduled for stonith to start them elsewhere.
> > 
> > > How can I delay node 2 resource start until the delayed stonith
> > action is 
> > > done? Or how can I just delay the resource start so I can make it
> > larger than 
> > > its pcmk_delay_base?
> > 
> > We probably need to see logs and configs to understand.
> > 
> > > 
> > > Also, I was suggested to set "stonith-enabled=true", but I don't
> > know where 
> > > to set this flag (cib-bootstrap-options is not happy with it...).
> > 
> > I think it's on by default, so you must have set it to false.
> > In crm shell it is "configure# property stonith-enabled=...".
> > 
> > Regards,
> > Ulrich
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-16 Thread Ken Gaillot
On Wed, 2020-12-16 at 15:16 +0100, Gabriele Bulfon wrote:
> Ok, I used some OpenIndiana patches and now it works, and also
> accepts the pcmk_delay_base param.

Good to know. Beyond ensuring Pacemaker builds, we don't target or test
BSDish distros much, so they're a bit more likely to see issues. We're
happy to take pull requests with fixes though. What patches did you
have to apply?

You may have run into something like this:

  https://bugs.clusterlabs.org/show_bug.cgi?id=5397 

>  
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

^^^ wow, love it
 
>  
> 
> 
> Da: Gabriele Bulfon 
> A: Cluster Labs - All topics related to open-source clustering
> welcomed 
> Data: 16 dicembre 2020 9.27.31 CET
> Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node
> failure
> 
> 
> >  
> > Thanks! I updated package to version 1.1.24, but now I receive an
> > error on pacemaker.log :
> >  
> > Dec 16 09:08:23 [5090] pacemakerd:error: mcp_read_config:  
> > Could not verify authenticity of CMAP provider: Operation not
> > supported (48)
> >  
> > Any idea what's wrong? This was not happening on version 1.1.17.
> > What I noticed is that 1.1.24 mcp/corosync.c has a new section at
> > line 334:
> >  
> > #if HAVE_CMAP
> > rc = cmap_fd_get(local_handle, );
> > if (rc != CS_OK) {
> > .
> > and this is the failing part at line 343:
> >  
> > if (!(rv = crm_ipc_is_authentic_process(fd, (uid_t) 0,(gid_t)
> > 0, _pid,
> > _uid,
> > _gid))) {
> >  
> > Gabriele
> >  
> >  
> > Sonicle S.r.l. : http://www.sonicle.com
> > Music: http://www.gabrielebulfon.com
> > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
> >  
> > 
> > 
> > 
> > -
> > -
> > 
> > Da: Andrei Borzenkov 
> > A: Cluster Labs - All topics related to open-source clustering
> > welcomed  
> > Data: 15 dicembre 2020 10.52.46 CET
> > Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from
> > node failure
> > 
> > > pcmk_delay_base was introduced in 1.1.17 and you apparently have
> > > 1.1.15 (unless it was backported by your distribution). Sorry.
> > > 
> > > pcmk_delay_max may work, I cannot find in changelog when it
> > > appeared.
> > > 
> > > 
> > > On Tue, Dec 15, 2020 at 11:52 AM Gabriele Bulfon <
> > > gbul...@sonicle.com> wrote:
> > > >
> > > > Here it is, thanks!
> > > >
> > > > Gabriele
> > > >
> > > >
> > > > Sonicle S.r.l. : http://www.sonicle.com
> > > > Music: http://www.gabrielebulfon.com
> > > > eXoplanets : 
> > > https://gabrielebulfon.bandcamp.com/album/exoplanets
> > > >
> > > >
> > > >
> > > >
> > > > -
> > > -
> > > >
> > > > Da: Andrei Borzenkov 
> > > > A: Cluster Labs - All topics related to open-source clustering
> > > welcomed 
> > > > Data: 14 dicembre 2020 15.56.32 CET
> > > > Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from
> > > node failure
> > > >
> > > > On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon <
> > > gbul...@sonicle.com> wrote:
> > > > >
> > > > > I isolated the log when everything happens (when I disable
> > > the ha interface), attached here.
> > > > >
> > > >
> > > > And where are matching logs from the second node?
> > > > ___
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > ClusterLabs home: https://www.clusterlabs.org/
> > > >
> > > >
> > > > ___
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > ClusterLabs home: https://www.clusterlabs.org/
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/
> > > 
> > > 
> > 
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Thanks, here are the logs, there are infos about how it tried to start 
resources on the nodes.
Keep in mind the node1 was already running the resources, and I simulated a 
problem by turning down the ha interface.
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Dec 16 15:08:54 [642] xstorage2 corosync notice  [TOTEM ] A processor failed, 
forming new configuration.
Dec 16 15:08:56 [642] xstorage2 corosync notice  [TOTEM ] A new membership 
(10.100.100.2:408) was formed. Members left: 1
Dec 16 15:08:56 [642] xstorage2 corosync notice  [TOTEM ] Failed to receive the 
leave message. failed: 1
Dec 16 15:08:56 [666]  attrd: info: pcmk_cpg_membership:Group 
attrd event 2: xstha1 (node 1 pid 710) left via cluster exit
Dec 16 15:08:56 [663]cib: info: pcmk_cpg_membership:Group 
cib event 2: xstha1 (node 1 pid 707) left via cluster exit
Dec 16 15:08:56 [662] pacemakerd: info: pcmk_cpg_membership:Group 
pacemakerd event 2: xstha1 (node 1 pid 687) left via cluster exit
Dec 16 15:08:56 [642] xstorage2 corosync notice  [QUORUM] Members[1]: 2
Dec 16 15:08:56 [662] pacemakerd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [666]  attrd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [662] pacemakerd: info: pcmk_cpg_membership:Group 
pacemakerd event 2: xstha2 (node 2 pid 662) is member
Dec 16 15:08:56 [642] xstorage2 corosync notice  [MAIN  ] Completed service 
synchronization, ready to provide service.
Dec 16 15:08:56 [668]   crmd: info: pcmk_cpg_membership:Group 
crmd event 2: xstha1 (node 1 pid 712) left via cluster exit
Dec 16 15:08:56 [664] stonith-ng: info: pcmk_cpg_membership:Group 
stonith-ng event 2: xstha1 (node 1 pid 708) left via cluster exit
Dec 16 15:08:56 [663]cib: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [668]   crmd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [666]  attrd:   notice: attrd_remove_voter: Lost attribute 
writer xstha1
Dec 16 15:08:56 [664] stonith-ng: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [662] pacemakerd: info: pcmk_quorum_notification:   Quorum 
retained | membership=408 members=1
Dec 16 15:08:56 [663]cib:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 16 15:08:56 [664] stonith-ng:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 16 15:08:56 [662] pacemakerd:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_reap_unseen_nodes
Dec 16 15:08:56 [668]   crmd: info: peer_update_callback:   Client 
xstha1/peer 

Re: [ClusterLabs] Best way to create a floating identity file

2020-12-16 Thread Ken Gaillot
On Wed, 2020-12-16 at 04:46 -0500, Tony Stocker wrote:
> On Tue, Dec 15, 2020 at 12:29 PM Ken Gaillot 
> wrote:
> > 
> > On Tue, 2020-12-15 at 17:02 +0300, Andrei Borzenkov wrote:
> > > On Tue, Dec 15, 2020 at 4:58 PM Tony Stocker <
> > > akostoc...@gmail.com>
> > > wrote:
> > > > 
> > 
> > Just for fun, some other possibilities:
> > 
> > You could write your script/cron as an OCF RA itself, with an
> > OCF_CHECK_LEVEL=20 monitor doing the actual work, scheduled to run
> > at
> > whatever interval you want (or using time-based rules, enabling it
> > to
> > run at a particular time). Then you can colocate it with the
> > workload
> > resources.
> > 
> > Or you could write a systemd timer unit to call your script when
> > desired, and colocate that with the workload as a systemd resource
> > in
> > the cluster.
> > 
> > Or similar to the crm_resource method, you could colocate an
> > ocf:pacemaker:attribute resource with the workload, and have your
> > script check the value of the node attribute (with attrd_updater
> > -Q) to
> > know whether to do stuff or not.
> > --
> 
> All three options look interesting, but the last one seems the
> simplest. Looking at the description I'm curious to know what happens
> with the 'inactive_value' string. Is that put in the 'state' file
> location whenever a node is not the active one? For example, when I
> first set up the attribute and it gets put on the active node
> currently running the resource group with the 'active_value' string,
> will the current backup node automatically get the same 'state' file
> created with the 'inactive_value'? Or does that only happen when the
> resource group is moved?
> 
> Secondly, does this actually create a file with a plaintext entry
> matching one of the *_value strings? Or is it simply an empty file
> with the information stored somewhere in the depths of the PM config?

The state file is just an empty file used to determine whether the
resource is "running" or not (since there's no actual daemon process
kept around for it).

> Finally (for the moment), what does the output of 'attrd_updater -Q'
> look like? I need to figure out how to utilize the output for a cron
> 'if' statement similar to the previous one:
> 
> if [ -f /var/local/project/cluster-node ] && [ `cat
> /var/local/project/cluster-node` = "distroserver" ]; then ...

First you need to know the node attribute name. By default this is
"opa-" plus the resource ID but you can configure it as a resource
parameter (name="whatever") if you want something more obvious.

Then you can query the value on the local node with:

attrd_updater -Q -n 

It's possible the attribute has not been set at all (the node has never
run the resource). In that case there will be an error return and a
message on stderr.

If the attribute has been set, the output will look like

name="attrname" host="nodename" value="1"

Looking at it now, I realize there should be a --quiet option to print
just the value by itself, but that doesn't exist currently. :) Also, we
are moving toward having the option of XML output for all tools, which
is more reliable for parsing by scripts than textual output that can at
least theoretically change from release to release, but attrd_updater
hasn't gained that capability yet.

That means a (somewhat uglier) one-liner test would be something like:

  [ "$(attrd_updater -Q -n attrname 2>/dev/null | sed -n -e 's/.* 
value="\(.*\)".*/\1/p')" = "1" ]

That relies on the fact that the value will be "1" (or whatever you set
as active_value) only if the attribute resource is currently active on
the local node. Otherwise it will be "0" (if the resource previously
ran on the local node but no longer is) or empty (if the resource never
ran on the local node).

> since the cron script is run on both nodes, I need to know how the
> output can be used to determine which node will run the necessary
> commands. If the return values are the same regardless of which node
> I
> run attrd_updater on, what do I use to differentiate?
> 
> Unfortunately right now I don't have a test cluster that I can play
> with things on, only a 'live' one that we had to rush into service
> with a bare minimum of testing, so I'm loath to play with things on
> it.
> 
> Thanks!
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Hi, I have now a two node cluster using stonith with different pcmk_delay_base, 
so that node 1 has priority to stonith node 2 in case of problems.
 
Though, there is still one problem: once node 2 delays its stonith action for 
10 seconds, and node 1 just 1, node 2 does not delay start of resources, so it 
happens that while it's not yet powered off by node 1 (and waiting its dalay to 
power off node 1) it actually starts resources, causing a moment of few seconds 
where both NFS IP and ZFS pool (!) is mounted by both!
How can I delay node 2 resource start until the delayed stonith action is done? 
Or how can I just delay the resource start so I can make it larger than its 
pcmk_delay_base?
 
Also, I was suggested to set "stonith-enabled=true", but I don't know where to 
set this flag (cib-bootstrap-options is not happy with it...).
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-16 Thread Gabriele Bulfon
Ok, I used some OpenIndiana patches and now it works, and also accepts the 
pcmk_delay_base param.
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

 


Da: Gabriele Bulfon 
A: Cluster Labs - All topics related to open-source clustering welcomed 

Data: 16 dicembre 2020 9.27.31 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure



 
Thanks! I updated package to version 1.1.24, but now I receive an error on 
pacemaker.log :
 
Dec 16 09:08:23 [5090] pacemakerd:    error: mcp_read_config:   Could not 
verify authenticity of CMAP provider: Operation not supported (48)
 
Any idea what's wrong? This was not happening on version 1.1.17.
What I noticed is that 1.1.24 mcp/corosync.c has a new section at line 334:
 
#if HAVE_CMAP
    rc = cmap_fd_get(local_handle, );
    if (rc != CS_OK) {
.
and this is the failing part at line 343:
 
    if (!(rv = crm_ipc_is_authentic_process(fd, (uid_t) 0,(gid_t) 0, _pid,
                                            _uid, _gid))) {
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 15 dicembre 2020 10.52.46 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


pcmk_delay_base was introduced in 1.1.17 and you apparently have
1.1.15 (unless it was backported by your distribution). Sorry.

pcmk_delay_max may work, I cannot find in changelog when it appeared.


On Tue, Dec 15, 2020 at 11:52 AM Gabriele Bulfon  wrote:
>
> Here it is, thanks!
>
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
> --
>
> Da: Andrei Borzenkov 
> A: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Data: 14 dicembre 2020 15.56.32 CET
> Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
>
> On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon  wrote:
> >
> > I isolated the log when everything happens (when I disable the ha 
> > interface), attached here.
> >
>
> And where are matching logs from the second node?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

2020-12-16 Thread Roger Zhou



On 12/16/20 5:06 PM, Ulrich Windl wrote:

Hi!

(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration 
correctly:

With my test-VM running on node h16, this happened when I tried to move it away 
(for testing):

Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate
prm_xen_test-jeos( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 aborted 
by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed


RA migration_to failed quickly. Maybe the configuration is not perfect enough?

How about enable trace, and collect more RA logs to check what exactly virsh 
command used and check if it works manually


`crm resource trace prm_xen_test-jeos`



Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!


Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or, 
articulate what to do with the migration_to fails. I couldn't find the 
definition from any doc yet.



Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover
prm_xen_test-jeos( h19 )


So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a 
"recovery" on h19 will start it there! So _after_ the recovery the VM is 
duplicate.

Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
(call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 
0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop operation 
for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start 
operation prm_xen_test-jeos_start_0 locally on h19

Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start 
(call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 
0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start 
operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020



yeah, schedulerd is trying so hard to report the migration_to failure here!


Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...



What about s/h18/h19/?

Or, manually reproduce exactly as the RA code:

`virsh ${VIRSH_OPTIONS} migrate --live $migrate_opts $DOMAIN_NAME $remoteuri 
$migrateuri`



Good luck!
Roger



Regards,
Ulrich Windl



Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 :

60728>:

Hi!

I think I found the problem why a VM ist started on two nodes:

Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
start (stop on h16, start on h19 for example).
When rebooting h16, I see these messages (h19 is DC):

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result
(error: test-jeos: live migration to h16 failed: 1) was recorded for
migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource
prm_xen_test-jeos is active on 2 nodes (attempting recovery)

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart
prm_xen_test-jeos( h16 )

THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
some autostart from libvirt. " virsh list --autostart" does not list any)

Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
test-jeos already stopped.

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated
transition 669 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-4.bz2

Whhat's going on here?

Regards,
Ulrich


Ulrich Windl schrieb am 14.12.2020 um 

[ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Ulrich Windl
>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
>  
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

>  
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] crm shell: "params param"?

2020-12-16 Thread Ulrich Windl
>>> Ulrich Windl schrieb am 16.12.2020 um 14:53 in Nachricht <5FDA1167.92C : 
>>> 161 :
60728>:

[...]
> primitive prm_xen_test-jeos VirtualDomain \
> params param config="/etc/libvirt/libxl/test-jeos.xml" 

BTW: Is "params param" a bug in crm shell? In older crm shell I only see 
"params"...

> hypervisor="xen:///system" autoset_utilization_cpu=false 
> autoset_utilization_hv_memory=false \
> op start timeout=120 interval=0 \
> op stop timeout=180 interval=0 \
> op monitor interval=600 timeout=90 \
> op migrate_to timeout=300 interval=0 \
> op migrate_from timeout=300 interval=0 \
> utilization utl_cpu=20 utl_ram=2048 \
> meta priority=123 allow-migrate=true
> 
[...]

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Antw: Another word of warning regarding VirtualDomain and Live Migration

2020-12-16 Thread Ulrich Windl
>>> Roger Zhou  schrieb am 16.12.2020 um 13:58 in Nachricht
<8ab80ef4-462c-421b-09b8-084d270d4...@suse.com>:

> On 12/16/20 5:06 PM, Ulrich Windl wrote:
>> Hi!
>> 
>> (I changed the subject of the thread)
>> VirtualDomain seems to be broken, as it does not handle a failed 
> live-,igration correctly:
>> 
>> With my test-VM running on node h16, this happened when I tried to move it 
> away (for testing):
>> 
>> Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate
> prm_xen_test-jeos( h16 -> h19 )
>> Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
> operation prm_xen_test-jeos_migrate_to_0 on h16
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 
> aborted by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event 
> failed
> 
> RA migration_to failed quickly. Maybe the configuration is not perfect 
> enough?

Probably you are right, but shouldn't the reason for the failure play any part 
in the decision how to handle it? Specifically the case when the node was just 
booted, and the cluster claimed a VM would be running there.
THere is no problem with one VM running on two nodes UNTIL pacemaker decides to 
"recover" from it. THEN one VM is running on two nodes.

For reference, I had configured "xen+tls" and I can live-migrate the VMs 
manually using "virsh". The RA config basically is:

primitive prm_xen_test-jeos VirtualDomain \
params param config="/etc/libvirt/libxl/test-jeos.xml" 
hypervisor="xen:///system" autoset_utilization_cpu=false 
autoset_utilization_hv_memory=false \
op start timeout=120 interval=0 \
op stop timeout=180 interval=0 \
op monitor interval=600 timeout=90 \
op migrate_to timeout=300 interval=0 \
op migrate_from timeout=300 interval=0 \
utilization utl_cpu=20 utl_ram=2048 \
meta priority=123 allow-migrate=true

> 
> How about enable trace, and collect more RA logs to check what exactly virsh 
> 
> command used and check if it works manually
> 
> `crm resource trace prm_xen_test-jeos`

BTW:
h16:~ # crm resource trace prm_xen_test-jeos
INFO: Trace for prm_xen_test-jeos is written to /var/lib/heartbeat/trace_ra/
INFO: Trace set, restart prm_xen_test-jeos to trace non-monitor operations
h16:~ # ll /var/lib/heartbeat/trace_ra/
ls: cannot access '/var/lib/heartbeat/trace_ra/': No such file or directory


h16:~ # virsh list
 Id   NameState
---
 0Domain-0running
 4test-jeos   running
h16:~ # crm resource move prm_xen_test-jeos PT5M force
Migration will take effect until: 2020-12-16 14:17:28 +01:00
INFO: Move constraint created for prm_xen_test-jeos

I could not find the trace information, but I found this in syslog:
Dec 16 14:17:13 h19 VirtualDomain(prm_xen_test-jeos)[20077]: INFO: test-jeos: 
Starting live migration to h18 (using: virsh --connect=xen:///system --quiet 
migrate --live  test-jeos xen://h18/system ).

I guess "xen://h18/system" should be "xen+tls://h18/system" here. In fact I 
verified it interactively. Due to the certificate subject the FQHN has to be 
used, too; otherwise I see "warning : virNetTLSContextCheckCertificate:1082 : 
Certificate check failed Certificate [session] owner does not match the 
hostname h18".

Regards,
Ulrich

> 
> 
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
> 115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h19 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h19 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
>> ### (note the message above is duplicate!)
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
>> ### This is nonsense after a failed live migration!
> 
> Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or, 
> articulate what to do with the migration_to fails. I couldn't find the 
> definition from any doc yet.
> 
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover
> prm_xen_test-jeos( h19 )
>> 
>> 
>> So the cluster is exactly doing the wrong thing: The VM ist still active on 
> h16, while a "recovery" on h19 will start it there! So _after_ the recovery 
> the VM is duplicate.
>> 
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
> operation prm_xen_test-jeos_stop_0 locally on h19
>> Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
> test-jeos already stopped.
>> Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
> (call 372, 

Re: [ClusterLabs] crm enhancement proposal (configure grep): Opinions?

2020-12-16 Thread Roger Zhou

Hi Ulrich,

Sounds reasonable and handy! Can you create the github issue to track this?

Thanks,
Roger


On 11/30/20 8:47 PM, Ulrich Windl wrote:

Hi!

Would would users of crm shell think about this enhancement proposal:
crm configure grep 
That command would search the configuration for any occurrence of  and 
would list the names where it occurred.

That is, if pattern is testXYZ, then all resources either having testXYZ in their name or 
have the string testXYZ anywhere "inside" would be listed.

One could even construct more interesting commands like
"show [all] matching " or "edit [all] matching "

Regards,
Ulrich




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-16 Thread Gabriele Bulfon
Thanks! I updated package to version 1.1.24, but now I receive an error on 
pacemaker.log :
 
Dec 16 09:08:23 [5090] pacemakerd:    error: mcp_read_config:   Could not 
verify authenticity of CMAP provider: Operation not supported (48)
 
Any idea what's wrong? This was not happening on version 1.1.17.
What I noticed is that 1.1.24 mcp/corosync.c has a new section at line 334:
 
#if HAVE_CMAP
    rc = cmap_fd_get(local_handle, );
    if (rc != CS_OK) {
.
and this is the failing part at line 343:
 
    if (!(rv = crm_ipc_is_authentic_process(fd, (uid_t) 0,(gid_t) 0, _pid,
                                            _uid, _gid))) {
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 15 dicembre 2020 10.52.46 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


pcmk_delay_base was introduced in 1.1.17 and you apparently have
1.1.15 (unless it was backported by your distribution). Sorry.

pcmk_delay_max may work, I cannot find in changelog when it appeared.


On Tue, Dec 15, 2020 at 11:52 AM Gabriele Bulfon  wrote:
>
> Here it is, thanks!
>
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
> --
>
> Da: Andrei Borzenkov 
> A: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Data: 14 dicembre 2020 15.56.32 CET
> Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
>
> On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon  wrote:
> >
> > I isolated the log when everything happens (when I disable the ha 
> > interface), attached here.
> >
>
> And where are matching logs from the second node?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Q: crm resource start/stop completion

2020-12-16 Thread Ulrich Windl
>>> Xin Liang  schrieb am 15.12.2020 um 03:55 in Nachricht


> Hi Ulrich,
> 
> crmsh already could do this completion:)

Hi!

Since when? Mine (crmsh-4.2.0+git.1604052559.2a348644-5.26.1.noarch) seems
unable: When I use "crm(live/h16)resource# start " and press TAB, I see
resources that are running already. May be that they don't have a target-role
(so they are started by default).

Can you confirm?

Regards,
Ulrich

> 
> Regards,
> Xin
> 
> From: Users  on behalf of Ulrich Windl 
> 
> Sent: Monday, December 14, 2020 6:51 PM
> To: users@clusterlabs.org 
> Subject: [ClusterLabs] Q: crm resource start/stop completion
> 
> Hi!
> 
> I wonder: Would it be difficult and would it make sense to change crm shell

> to:
> Complete "resource start " with only those resources that aren't running 
> (per role) already
> Complete "resource stopt " with only those resources that are running (per 
> role)
> 
> I came to this after seeing "3 disabled resources" in crm_mon output. As 
> crm_mon itself cannot display just those disabled resources (yet? ;‑)), I
used 
> crm shell...
> 
> Regards,
> Ulrich
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Best way to create a floating identity file

2020-12-16 Thread Tony Stocker
On Tue, Dec 15, 2020 at 12:29 PM Ken Gaillot  wrote:
>
> On Tue, 2020-12-15 at 17:02 +0300, Andrei Borzenkov wrote:
> > On Tue, Dec 15, 2020 at 4:58 PM Tony Stocker 
> > wrote:
> > >
> Just for fun, some other possibilities:
>
> You could write your script/cron as an OCF RA itself, with an
> OCF_CHECK_LEVEL=20 monitor doing the actual work, scheduled to run at
> whatever interval you want (or using time-based rules, enabling it to
> run at a particular time). Then you can colocate it with the workload
> resources.
>
> Or you could write a systemd timer unit to call your script when
> desired, and colocate that with the workload as a systemd resource in
> the cluster.
>
> Or similar to the crm_resource method, you could colocate an
> ocf:pacemaker:attribute resource with the workload, and have your
> script check the value of the node attribute (with attrd_updater -Q) to
> know whether to do stuff or not.
> --

All three options look interesting, but the last one seems the
simplest. Looking at the description I'm curious to know what happens
with the 'inactive_value' string. Is that put in the 'state' file
location whenever a node is not the active one? For example, when I
first set up the attribute and it gets put on the active node
currently running the resource group with the 'active_value' string,
will the current backup node automatically get the same 'state' file
created with the 'inactive_value'? Or does that only happen when the
resource group is moved?

Secondly, does this actually create a file with a plaintext entry
matching one of the *_value strings? Or is it simply an empty file
with the information stored somewhere in the depths of the PM config?

Finally (for the moment), what does the output of 'attrd_updater -Q'
look like? I need to figure out how to utilize the output for a cron
'if' statement similar to the previous one:

if [ -f /var/local/project/cluster-node ] && [ `cat
/var/local/project/cluster-node` = "distroserver" ]; then ...

since the cron script is run on both nodes, I need to know how the
output can be used to determine which node will run the necessary
commands. If the return values are the same regardless of which node I
run attrd_updater on, what do I use to differentiate?

Unfortunately right now I don't have a test cluster that I can play
with things on, only a 'live' one that we had to rush into service
with a bare minimum of testing, so I'm loath to play with things on
it.

Thanks!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

2020-12-16 Thread Ulrich Windl
Hi!

(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration 
correctly:

With my test-VM running on node h16, this happened when I tried to move it away 
(for testing):

Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate
prm_xen_test-jeos( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 aborted 
by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover
prm_xen_test-jeos( h19 )


So the cluster is exactly doing the wrong thing: The VM ist still active on 
h16, while a "recovery" on h19 will start it there! So _after_ the recovery the 
VM is duplicate.

Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
(call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 
0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop operation 
for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start 
operation prm_xen_test-jeos_start_0 locally on h19

Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start 
(call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 
0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start 
operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020

Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...

Regards,
Ulrich Windl


>>> Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 
>>> 161 :
60728>:
> Hi!
> 
> I think I found the problem why a VM ist started on two nodes:
> 
> Live-Migration had failed (e.g. away from h16), so the cluster uses stop and 
> start (stop on h16, start on h19 for example).
> When rebooting h16, I see these messages (h19 is DC):
> 
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h16 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource 
> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
> 
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart
> prm_xen_test-jeos( h16 )
> 
> THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was 
> some autostart from libvirt. " virsh list --autostart" does not list any)
> 
> Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain 
> test-jeos already stopped.
> 
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated 
> transition 669 (with errors), saving inputs in 
> /var/lib/pacemaker/pengine/pe-error-4.bz2
> 
> Whhat's going on here?
> 
> Regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 
> >>> 161 
> :
> 60728>:
> > Hi!
> > 
> > Another word of warning regarding VirtualDomain: While configuring a 3-node 
> 
> > cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain 
> RA), 
> > I had created a TestVM using BtrFS.
> > At some time of testing the cluster ended with the testVM running on more 
> > than one node (for reasons still to examine). Only after a "crm resource 
> > refresh" (rebprobe) the cluster tried to fix the problem.
> > Well at some point the VM wouldn't start any more, because the BtrFS used 
> > for all (SLES default) was corrupted in a way that seems