from:"Madison Kelly"

[ClusterLabs] Detecting pacemaker version incompatibility during node rebuild

2024-06-13 Thread Madison Kelly


Hi all,

  I'm working on a tool to rebuild a node that was lost. Given this 
scenario, upgrading the surviving node is not viable (at least, not 
until after the rebuild is completed and the services can be migrated).


  I ran into a problem where 'pcs cluster start' exits with RC 0, and 
it _looks_ like the cluster is starting, but then it exits without a 
message on STDOUT. In the logs though, I can see this;



Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: 
notice: Node an-a01n01 state is now member
Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: error: 
Local feature set (3.17.4) is incompatible with DC's (3.19.0)
Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: 
notice: Forcing immediate exit with status 100 (Fatal error occurred, 
will not respawn)
Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: 
warning: Inhibiting respawn



  So I have two questions;

1. Is there a way to test (using pcs or another tool) to see if the 
local machine is compatible with the peer?


2. If the node being rebuilt isn't compatible, is there a way to tell it 
to start in a compatibility mode, or to tell the surviving peer to 
switch to a compatibility mode? Which depending on which is newer.


  Of course, in this particular test case, the node being rebuilt is 
behind the survivor, so the fix here is a simple update of pacemaker 
before rejoining. However in the real world, it's far more likely that 
the node being joined will be a newer version.


  The reason for this is that a large number of our deployments are in 
location with no or limited internet. So keeping the active cluster 
regularly updated is not feasible (and some clients "lock" their 
deployments to approved/tested versions).


Thanks for any hints/tips!

Madi

--
wiki - https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] When did the CIB change how it reports in_ccm and crmd?

2024-01-20 Thread Madison Kelly

I'm sure it was announced and I missed it, but I just tripped over my 
pants when an update changed 'in_ccm' and 'crmd' in the CIB from 
'true/false' to timestamps...


When did that happen? Is there an announcement marking other changes 
that happened at the same time?


Cheers,

Madi

--
wiki - https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Planning for Pacemaker 3

2024-01-03 Thread Madison Kelly


On 2024-01-03 12:06, Ken Gaillot wrote:

Hi all,

I'd like to release Pacemaker 3.0.0 around the middle of this year.
I'm gathering proposed changes here:

  https://projects.clusterlabs.org/w/projects/pacemaker/pacemaker_3.0_changes/

Please review for anything that might affect you, and reply here if you
have any concerns.

Pacemaker major-version releases drop support for deprecated features,
to make the code easier to maintain. The biggest planned changes are
dropping support for Upstart and Nagios resources, as well as rolling
upgrades from Pacemaker 1. Much of the lowest-level public C API will
be dropped.

Because the changes will be backward-incompatible, we will continue to
make 2.1 releases for a few years, with backports of compatible fixes,
to help distribution packagers who need to keep backward compatibility.


If this is already a feature, this is going to sound silly...

Would it be possible to trigger scripts if a resource or stonith device 
entered a FAILED state?


--
wiki -https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] The Linux-HA site is down.

2023-05-03 Thread Madison Kelly


  
  
On 2023-05-03 05:26, 黃暄皓 wrote:


  
  As the title said,is it still in maintenance?

I'm not sure who even owns or maintains that old domain. I don't
think it's been used or maintained for a long time.
-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-12 Thread Madison Kelly

On 2023-01-12 04:50, Keisuke MORI wrote:

Hi,

Just a guess but could it be the same issue with this?

https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background

That was exactly what it was! Bandini linked the same thing last night.
I fixed it by calling 'setsid --wait virsh '.

Thanks!

2023年1月12日(木) 15:36 Madison Kelly :

On 2023-01-12 01:26, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly wrote:

On 2023-01-12 01:12, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly wrote:

Hi all,

There was a lot of sub-threads, so I figured it's helpful to start a
new thread with a summary so far. For context; I have a super simple
perl script that pretends to be an RA for the sake of debugging.

https://pastebin.com/9z314TaB

I've had variations log environment variables and confirmed that all
the variables in the direct call that work are in the crm_resource
triggered call. There are no selinux issues logged in audit.log and
selinux is permissive. The script logs the real and effective UID and
GID and it's the same in both instances. Calling other shell programs
(tested with 'hostname') run fine, this is specifically crm_resource ->
test RA -> virsh call.

I ran strace on the virsh call from inside my test script (changing
'virsh.good' to 'virsh.bad' between running directly and via
crm_resource. The strace runs made six files each time. Below are
pastebin links with the outputs of the six runs in one paste, but each
file's output is in it's own block (search for file: to see the
different file outputs)

Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.

The strace outputs show that your bad runs are all getting stopped
with SIGTTOU. If you've never heard of that, me either.

The hell?! This is new to me also.

https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html

Macro: int SIGTTOU

This is similar to SIGTTIN, but is generated when a process in a
background job attempts to write to the terminal or set its modes.
Again, the default action is to stop the process. SIGTTOU is only
generated for an attempt to write to the terminal if the TOSTOP output
mode is set; see Output Modes.

Maybe this has something to do with the buffer settings in the perl
script(?). It might be worth trying a version that doesn't fiddle with
the outputs and buffer settings.

I tried removing the $|, and then I changed the script to be entirely a
bash script, still hanging. I tried 'virsh --connect list
--all' where method was qemu:///system, qemu:///session, and
ssh+qemu:///root@localhost/system, all hang. In bash or perl.

I don't know which difference between your environment and mine is
relevant here, such that I can't reproduce the issue using your test
script. It works perfectly fine for me.

Can you run `stty -a | grep tostop`? If there's a minus sign
("-tostop"), it's disabled; if it's present without a minus sign
("tostop"), it's enabled, as best I can tell.

-tostop is there

[root@mk-a07n02 ~]# stty -a | grep tostop
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
[root@mk-a07n02 ~]#

I'm just spitballing here. It's disabled by default on my machine...
but even when I enable it, crm_resource --validate works fine. It may
be set differently when running under crm_resource.

How do you enable it?

With `stty tostop`

It's 100% possible that this whole thing is a red herring by the way.
I'm looking for anything that might explain the discrepancy. SIGTTOU
may not be directly tied to the root cause.

Appreciate the stab, didn't stop the hang though :(

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't) - SOLVED!

2023-01-12 Thread Madison Kelly

On 2023-01-11 23:10, Madison Kelly wrote:

Hi all,

https://pastebin.com/9z314TaB

Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.

bandini found the problem

https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background

/usr/bin/setsid --wait /usr/bin/virsh list --all

That fixed it.

omg. I'm going to sleep. holy crap.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-12 Thread Madison Kelly


On 2023-01-12 01:32, Vladislav Bogdanov wrote:
What would be the reason of running that command without redirecting its 
output somewhere?


In the real RA I am. I made a super stripped down test script to figure 
out how to make any call to virsh that didn't end up with it hanging.


--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-11 Thread Madison Kelly


On 2023-01-12 01:28, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 10:27 PM Reid Wahl  wrote:


On Wed, Jan 11, 2023 at 10:26 PM Reid Wahl  wrote:


On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly  wrote:


On 2023-01-12 01:12, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly  wrote:


Hi all,

 There was a lot of sub-threads, so I figured it's helpful to start a
new thread with a summary so far. For context; I have a super simple
perl script that pretends to be an RA for the sake of debugging.

https://pastebin.com/9z314TaB

 I've had variations log environment variables and confirmed that all
the variables in the direct call that work are in the crm_resource
triggered call. There are no selinux issues logged in audit.log and
selinux is permissive. The script logs the real and effective UID and
GID and it's the same in both instances. Calling other shell programs
(tested with 'hostname') run fine, this is specifically crm_resource ->
test RA -> virsh call.

 I ran strace on the virsh call from inside my test script (changing
'virsh.good' to 'virsh.bad' between running directly and via
crm_resource. The strace runs made six files each time. Below are
pastebin links with the outputs of the six runs in one paste, but each
file's output is in it's own block (search for file: to see the
different file outputs)

Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.


The strace outputs show that your bad runs are all getting stopped
with SIGTTOU. If you've never heard of that, me either.


The hell?! This is new to me also.


https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html

Macro: int SIGTTOU

  This is similar to SIGTTIN, but is generated when a process in a
background job attempts to write to the terminal or set its modes.
Again, the default action is to stop the process. SIGTTOU is only
generated for an attempt to write to the terminal if the TOSTOP output
mode is set; see Output Modes.


Maybe this has something to do with the buffer settings in the perl
script(?). It might be worth trying a version that doesn't fiddle with
the outputs and buffer settings.


I tried removing the $|, and then I changed the script to be entirely a
bash script, still hanging. I tried 'virsh --connect  list
--all' where method was qemu:///system, qemu:///session, and
ssh+qemu:///root@localhost/system, all hang. In bash or perl.


I don't know which difference between your environment and mine is
relevant here, such that I can't reproduce the issue using your test
script. It works perfectly fine for me.

Can you run `stty -a | grep tostop`? If there's a minus sign
("-tostop"), it's disabled; if it's present without a minus sign
("tostop"), it's enabled, as best I can tell.


-tostop is there


[root@mk-a07n02 ~]# stty -a | grep tostop
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
[root@mk-a07n02 ~]#



I'm just spitballing here. It's disabled by default on my machine...
but even when I enable it, crm_resource --validate works fine. It may
be set differently when running under crm_resource.


How do you enable it?


With `stty tostop`


If anything it should be disabled though


I'd be very interested in whether anyone else can reproduce this with
your test script


So would I!!

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-11 Thread Madison Kelly


On 2023-01-12 01:26, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly  wrote:


On 2023-01-12 01:12, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly  wrote:


Hi all,

 There was a lot of sub-threads, so I figured it's helpful to start a
new thread with a summary so far. For context; I have a super simple
perl script that pretends to be an RA for the sake of debugging.

https://pastebin.com/9z314TaB

 I've had variations log environment variables and confirmed that all
the variables in the direct call that work are in the crm_resource
triggered call. There are no selinux issues logged in audit.log and
selinux is permissive. The script logs the real and effective UID and
GID and it's the same in both instances. Calling other shell programs
(tested with 'hostname') run fine, this is specifically crm_resource ->
test RA -> virsh call.

 I ran strace on the virsh call from inside my test script (changing
'virsh.good' to 'virsh.bad' between running directly and via
crm_resource. The strace runs made six files each time. Below are
pastebin links with the outputs of the six runs in one paste, but each
file's output is in it's own block (search for file: to see the
different file outputs)

Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.


The strace outputs show that your bad runs are all getting stopped
with SIGTTOU. If you've never heard of that, me either.


The hell?! This is new to me also.


https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html

Macro: int SIGTTOU

  This is similar to SIGTTIN, but is generated when a process in a
background job attempts to write to the terminal or set its modes.
Again, the default action is to stop the process. SIGTTOU is only
generated for an attempt to write to the terminal if the TOSTOP output
mode is set; see Output Modes.


Maybe this has something to do with the buffer settings in the perl
script(?). It might be worth trying a version that doesn't fiddle with
the outputs and buffer settings.


I tried removing the $|, and then I changed the script to be entirely a
bash script, still hanging. I tried 'virsh --connect  list
--all' where method was qemu:///system, qemu:///session, and
ssh+qemu:///root@localhost/system, all hang. In bash or perl.


I don't know which difference between your environment and mine is
relevant here, such that I can't reproduce the issue using your test
script. It works perfectly fine for me.

Can you run `stty -a | grep tostop`? If there's a minus sign
("-tostop"), it's disabled; if it's present without a minus sign
("tostop"), it's enabled, as best I can tell.


-tostop is there


[root@mk-a07n02 ~]# stty -a | grep tostop
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
[root@mk-a07n02 ~]#



I'm just spitballing here. It's disabled by default on my machine...
but even when I enable it, crm_resource --validate works fine. It may
be set differently when running under crm_resource.


How do you enable it?


With `stty tostop`

It's 100% possible that this whole thing is a red herring by the way.
I'm looking for anything that might explain the discrepancy. SIGTTOU
may not be directly tied to the root cause.


Appreciate the stab, didn't stop the hang though :(

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-11 Thread Madison Kelly


On 2023-01-12 01:12, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 10:12 PM Reid Wahl  wrote:


On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly  wrote:


Hi all,

There was a lot of sub-threads, so I figured it's helpful to start a
new thread with a summary so far. For context; I have a super simple
perl script that pretends to be an RA for the sake of debugging.

https://pastebin.com/9z314TaB

I've had variations log environment variables and confirmed that all
the variables in the direct call that work are in the crm_resource
triggered call. There are no selinux issues logged in audit.log and
selinux is permissive. The script logs the real and effective UID and
GID and it's the same in both instances. Calling other shell programs
(tested with 'hostname') run fine, this is specifically crm_resource ->
test RA -> virsh call.

I ran strace on the virsh call from inside my test script (changing
'virsh.good' to 'virsh.bad' between running directly and via
crm_resource. The strace runs made six files each time. Below are
pastebin links with the outputs of the six runs in one paste, but each
file's output is in it's own block (search for file: to see the
different file outputs)

Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.


The strace outputs show that your bad runs are all getting stopped
with SIGTTOU. If you've never heard of that, me either.

https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html

Macro: int SIGTTOU

 This is similar to SIGTTIN, but is generated when a process in a
background job attempts to write to the terminal or set its modes.
Again, the default action is to stop the process. SIGTTOU is only
generated for an attempt to write to the terminal if the TOSTOP output
mode is set; see Output Modes.


Maybe this has something to do with the buffer settings in the perl
script(?). It might be worth trying a version that doesn't fiddle with
the outputs and buffer settings.

I don't know which difference between your environment and mine is
relevant here, such that I can't reproduce the issue using your test
script. It works perfectly fine for me.

Can you run `stty -a | grep tostop`? If there's a minus sign
("-tostop"), it's disabled; if it's present without a minus sign
("tostop"), it's enabled, as best I can tell.

I'm just spitballing here. It's disabled by default on my machine...
but even when I enable it, crm_resource --validate works fine. It may
be set differently when running under crm_resource.


I meant to include this:
https://stackoverflow.com/questions/10588334/unix-background-process-stopped-abnormally


If I understand the post;


[root@mk-a07n02 ~]# /usr/bin/nohup perl 
/usr/lib/ocf/resource.d/alteeve/server

/usr/bin/nohup: ignoring input and appending output to 'nohup.out'
[root@mk-a07n02 ~]#


I see the output of the virsh call in the logs fine, no hang.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-11 Thread Madison Kelly


On 2023-01-12 01:12, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly  wrote:


Hi all,

There was a lot of sub-threads, so I figured it's helpful to start a
new thread with a summary so far. For context; I have a super simple
perl script that pretends to be an RA for the sake of debugging.

https://pastebin.com/9z314TaB

I've had variations log environment variables and confirmed that all
the variables in the direct call that work are in the crm_resource
triggered call. There are no selinux issues logged in audit.log and
selinux is permissive. The script logs the real and effective UID and
GID and it's the same in both instances. Calling other shell programs
(tested with 'hostname') run fine, this is specifically crm_resource ->
test RA -> virsh call.

I ran strace on the virsh call from inside my test script (changing
'virsh.good' to 'virsh.bad' between running directly and via
crm_resource. The strace runs made six files each time. Below are
pastebin links with the outputs of the six runs in one paste, but each
file's output is in it's own block (search for file: to see the
different file outputs)

Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.


The strace outputs show that your bad runs are all getting stopped
with SIGTTOU. If you've never heard of that, me either.


The hell?! This is new to me also.


https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html

Macro: int SIGTTOU

 This is similar to SIGTTIN, but is generated when a process in a
background job attempts to write to the terminal or set its modes.
Again, the default action is to stop the process. SIGTTOU is only
generated for an attempt to write to the terminal if the TOSTOP output
mode is set; see Output Modes.


Maybe this has something to do with the buffer settings in the perl
script(?). It might be worth trying a version that doesn't fiddle with
the outputs and buffer settings.


I tried removing the $|, and then I changed the script to be entirely a 
bash script, still hanging. I tried 'virsh --connect  list 
--all' where method was qemu:///system, qemu:///session, and 
ssh+qemu:///root@localhost/system, all hang. In bash or perl.



I don't know which difference between your environment and mine is
relevant here, such that I can't reproduce the issue using your test
script. It works perfectly fine for me.

Can you run `stty -a | grep tostop`? If there's a minus sign
("-tostop"), it's disabled; if it's present without a minus sign
("tostop"), it's enabled, as best I can tell.


-tostop is there


[root@mk-a07n02 ~]# stty -a | grep tostop
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
[root@mk-a07n02 ~]#



I'm just spitballing here. It's disabled by default on my machine...
but even when I enable it, crm_resource --validate works fine. It may
be set differently when running under crm_resource.


How do you enable it?

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-11 Thread Madison Kelly


Hi all,

  There was a lot of sub-threads, so I figured it's helpful to start a 
new thread with a summary so far. For context; I have a super simple 
perl script that pretends to be an RA for the sake of debugging.


https://pastebin.com/9z314TaB

  I've had variations log environment variables and confirmed that all 
the variables in the direct call that work are in the crm_resource 
triggered call. There are no selinux issues logged in audit.log and 
selinux is permissive. The script logs the real and effective UID and 
GID and it's the same in both instances. Calling other shell programs 
(tested with 'hostname') run fine, this is specifically crm_resource -> 
test RA -> virsh call.


  I ran strace on the virsh call from inside my test script (changing 
'virsh.good' to 'virsh.bad' between running directly and via 
crm_resource. The strace runs made six files each time. Below are 
pastebin links with the outputs of the six runs in one paste, but each 
file's output is in it's own block (search for file: to see the 
different file outputs)


Good/direct run of the test RA:
- https://pastebin.com/xtqe9NSG

Bad/crm_resource triggered run of the test RA:
- https://pastebin.com/vBiLVejW

Still absolutely stumped.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 14:49, Ken Gaillot wrote:

On Wed, 2023-01-11 at 14:09 -0500, Madison Kelly wrote:

On 2023-01-11 14:01, Madison Kelly wrote:

On 2023-01-11 01:59, Reid Wahl wrote:

On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
 wrote:

I suspect that valudate action is run as a non-root user.


As far as I know, both the direct command and crm_resource
**should**
be running the agent as the same user, as long as Madison is
running
both commands as the same user.

For what it's worth, I copied your test script to my machine
(Fedora
36 using the current upstream main of Pacemaker) and it worked
fine
both directly and via crm_resource. At the moment I'm not able to
dig
very deeply, but I do wonder if it's either a bug that's since
been
fixed, or perhaps an environment issue.

To try to rule out the former, do you have a test environment
where
you can try to reproduce it on the latest Pacemaker from
upstream?


I am running both as the same (root, direct ssh, not sudo'd) user.
I run
them back-to-back with consistent results.

I've not built pacemaker in ages. Is there a src.rpm that's likely
to
build against centos stream 8 I could try? If not, do you know the
command off and hand to create the rpm's from source? If not, I'll
grab
the source and read the docs for configure.


Never mind, I've got it building. Will test shortly.


FYI, you can run "make -C rpm rpm" from a source checkout.


[root@mk-a07n02 RPMS]# pacemakerd --version
Pacemaker 2.1.5-1.39e62b78e.git.el8

Build from main just now, same issue. :/

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 16:23, Vladislav Bogdanov wrote:
And, one more thing can affect that - selinux. I doubt, but that's worth 
checking.


selinux is permissive, and nothing written to audit.log. Side note; I 
checked GID and effective GID and well as UID and EUID, all 0.


I recorded environment variables, and removed from below matching ones. 
Here's the differences;



-=] Direct
Environment: [_] -> [/usr/lib/ocf/resource.d/alteeve/server]

-=] crm_resource
Environment: [HA_debug] -> [0]
Environment: [HA_logfacility] -> [none]
Environment: [OCF_EXIT_REASON_PREFIX] -> [ocf-exit-reason:]
Environment: [OCF_OUTPUT_FORMAT] -> [xml]
Environment: [OCF_RA_VERSION_MAJOR] -> [1]
Environment: [OCF_RA_VERSION_MINOR] -> [1]
Environment: [OCF_RESKEY_CRM_meta_timeout] -> [2]
Environment: [OCF_RESKEY_crm_feature_set] -> [3.16.2]
Environment: [OCF_RESKEY_name] -> [srv04-test]
Environment: [OCF_RESOURCE_INSTANCE] -> [test]
Environment: [OCF_RESOURCE_PROVIDER] -> [alteeve]
Environment: [OCF_RESOURCE_TYPE] -> [server]
Environment: [OCF_ROOT] -> [/usr/lib/ocf]
Environment: [OCF_TRACE_FILE] -> [/dev/stderr]
Environment: [PCMK_logfacility] -> [none]
Environment: [PCMK_service] -> [crm_resource]
Environment: [_] -> [/usr/sbin/crm_resource]




Vladislav Bogdanov  11 января 2023 г. 22:21:03 
написал:


Then I would suggest to log all env vars and compare them, probably 
something is missing in validate for virsh to be happy.


Madison Kelly  11 января 2023 г. 22:06:45 написал:


On 2023-01-11 01:13, Vladislav Bogdanov wrote:

I suspect that valudate action is run as a non-root user.


I modified the script to log the real and effective UIDs and it's
running as root in both instances.


Madison Kelly  11 января 2023 г. 07:06:55 написал:


On 2023-01-11 00:21, Madison Kelly wrote:

On 2023-01-11 00:14, Madison Kelly wrote:

Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped on
what's going on.

   The short of it is; if my RA is called from the command line, it's
fine. If a resource exists, monitor, enable, disable, all that stuff
works just fine. If I try to create a resource, it hangs on the
validate stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server
--provider alteeve --option name=

   Specifically, it hangs when it tries to make a shell call (to
virsh, specifically, but that doesn't matter). So to debug, I started
stripping down my RA simpler and simpler until I was left with the
very most basic of programs;

https://pastebin.com/VtSpkwMr

   That is literally the simplest program I could write that made the
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
srv04-test; echo rc:$?


real    0m0.061s
user    0m0.037s
sys    0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call
as well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent
server --provider alteeve --option name=srv04-test; echo rc:$?



name=srv04-test">
   
 
 
   
   
 
   crm_resource: Error performing operation: Error
occurred
 
   


real    0m20.521s
user    0m0.022s
sys    0m0.010s
rc:1


In the log file, I see (from line 20 of the 
super-simple-test-script):



Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
/usr/bin/echo return_code:0 |]


Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/


After sending this, I tried having my "RA" call 'hostname', and that
worked fine. I switched back to 'virsh list --all', and that hangs. So
it seems to somehow be related to call 'virsh' specifically.



OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.

First, I see in the logs:


Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer


So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:


line: [write(4, "\1\0\0\0\0\0\0\0", 8)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 15:55, Reid Wahl wrote:

On Wed, Jan 11, 2023 at 12:48 PM Madison Kelly  wrote:


On 2023-01-11 01:59, Reid Wahl wrote:

On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
 wrote:


I suspect that valudate action is run as a non-root user.


As far as I know, both the direct command and crm_resource **should**
be running the agent as the same user, as long as Madison is running
both commands as the same user.

For what it's worth, I copied your test script to my machine (Fedora
36 using the current upstream main of Pacemaker) and it worked fine
both directly and via crm_resource. At the moment I'm not able to dig
very deeply, but I do wonder if it's either a bug that's since been
fixed, or perhaps an environment issue.

To try to rule out the former, do you have a test environment where
you can try to reproduce it on the latest Pacemaker from upstream?


I built the pacemaker source RPM from Fedora 37, then realized I'm
already running 2.1.5 on CS8, so I'm already on the latest release.
Looking at git, 2.1.5 is the latest tagged release... Are you running
newer than that?


I'm running on the current main, which contains commits that came
after the 2.1.5 release. I don't really expect this to be a Pacemaker
bug, especially with how recent your version is, but I would like to
rule that out if possible.


You would have either the src.rpm or the ./configure options you used 
off hand?



Madison Kelly  11 января 2023 г. 07:06:55 написал:


On 2023-01-11 00:21, Madison Kelly wrote:


On 2023-01-11 00:14, Madison Kelly wrote:


Hi all,

Edit: Last message was in HTML format, sorry about that.

 I've got a hell of a weird problem, and I am absolutely stumped on
what's going on.

 The short of it is; if my RA is called from the command line, it's
fine. If a resource exists, monitor, enable, disable, all that stuff
works just fine. If I try to create a resource, it hangs on the
validate stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server
--provider alteeve --option name=

 Specifically, it hangs when it tries to make a shell call (to
virsh, specifically, but that doesn't matter). So to debug, I started
stripping down my RA simpler and simpler until I was left with the
very most basic of programs;

https://pastebin.com/VtSpkwMr

 That is literally the simplest program I could write that made the
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
srv04-test; echo rc:$?


real0m0.061s
user0m0.037s
sys0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call
as well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent
server --provider alteeve --option name=srv04-test; echo rc:$?



 
   
   
 
 
   
 crm_resource: Error performing operation: Error
occurred
   
 


real0m20.521s
user0m0.022s
sys0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
/usr/bin/echo return_code:0 |]


Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/



After sending this, I tried having my "RA" call 'hostname', and that
worked fine. I switched back to 'virsh list --all', and that hangs. So
it seems to somehow be related to call 'virsh' specifically.



OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.

First, I see in the logs:


Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer


So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:


line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
line: [brk(NULL)   = 0x562b7877d000]
line: [brk(0x562b787aa000) = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]


Around line 959 in the bad paste. There are more brk() lines, and not
long after the output stops.

--
Madison Kelly
Alteeve's Ni

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 01:13, Vladislav Bogdanov wrote:

I suspect that valudate action is run as a non-root user.


I modified the script to log the real and effective UIDs and it's 
running as root in both instances.



Madison Kelly  11 января 2023 г. 07:06:55 написал:


On 2023-01-11 00:21, Madison Kelly wrote:

On 2023-01-11 00:14, Madison Kelly wrote:

Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped on
what's going on.

   The short of it is; if my RA is called from the command line, it's
fine. If a resource exists, monitor, enable, disable, all that stuff
works just fine. If I try to create a resource, it hangs on the
validate stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server
--provider alteeve --option name=

   Specifically, it hangs when it tries to make a shell call (to
virsh, specifically, but that doesn't matter). So to debug, I started
stripping down my RA simpler and simpler until I was left with the
very most basic of programs;

https://pastebin.com/VtSpkwMr

   That is literally the simplest program I could write that made the
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
srv04-test; echo rc:$?


real    0m0.061s
user    0m0.037s
sys    0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call
as well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent
server --provider alteeve --option name=srv04-test; echo rc:$?



   
 
 
   
   
 
   crm_resource: Error performing operation: Error
occurred
 
   


real    0m20.521s
user    0m0.022s
sys    0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
/usr/bin/echo return_code:0 |]


Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/


After sending this, I tried having my "RA" call 'hostname', and that
worked fine. I switched back to 'virsh list --all', and that hangs. So
it seems to somehow be related to call 'virsh' specifically.



OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.

First, I see in the logs:


Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer


So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:


line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
line: [brk(NULL)                               = 0x562b7877d000]
line: [brk(0x562b787aa000)                     = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]


Around line 959 in the bad paste. There are more brk() lines, and not
long after the output stops.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 01:59, Reid Wahl wrote:

On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
 wrote:


I suspect that valudate action is run as a non-root user.


As far as I know, both the direct command and crm_resource **should**
be running the agent as the same user, as long as Madison is running
both commands as the same user.

For what it's worth, I copied your test script to my machine (Fedora
36 using the current upstream main of Pacemaker) and it worked fine
both directly and via crm_resource. At the moment I'm not able to dig
very deeply, but I do wonder if it's either a bug that's since been
fixed, or perhaps an environment issue.

To try to rule out the former, do you have a test environment where
you can try to reproduce it on the latest Pacemaker from upstream?


I built the pacemaker source RPM from Fedora 37, then realized I'm 
already running 2.1.5 on CS8, so I'm already on the latest release. 
Looking at git, 2.1.5 is the latest tagged release... Are you running 
newer than that?



Madison Kelly  11 января 2023 г. 07:06:55 написал:


On 2023-01-11 00:21, Madison Kelly wrote:


On 2023-01-11 00:14, Madison Kelly wrote:


Hi all,

Edit: Last message was in HTML format, sorry about that.

I've got a hell of a weird problem, and I am absolutely stumped on
what's going on.

The short of it is; if my RA is called from the command line, it's
fine. If a resource exists, monitor, enable, disable, all that stuff
works just fine. If I try to create a resource, it hangs on the
validate stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server
--provider alteeve --option name=

Specifically, it hangs when it tries to make a shell call (to
virsh, specifically, but that doesn't matter). So to debug, I started
stripping down my RA simpler and simpler until I was left with the
very most basic of programs;

https://pastebin.com/VtSpkwMr

That is literally the simplest program I could write that made the
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
srv04-test; echo rc:$?


real0m0.061s
user0m0.037s
sys0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call
as well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent
server --provider alteeve --option name=srv04-test; echo rc:$?




  
  


  
crm_resource: Error performing operation: Error
occurred
  



real0m20.521s
user0m0.022s
sys0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
/usr/bin/echo return_code:0 |]


Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/



After sending this, I tried having my "RA" call 'hostname', and that
worked fine. I switched back to 'virsh list --all', and that hangs. So
it seems to somehow be related to call 'virsh' specifically.



OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.

First, I see in the logs:


Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer


So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:


line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
line: [brk(NULL)   = 0x562b7877d000]
line: [brk(0x562b787aa000) = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]


Around line 959 in the bad paste. There are more brk() lines, and not
long after the output stops.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 14:01, Madison Kelly wrote:

On 2023-01-11 01:59, Reid Wahl wrote:

On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
 wrote:


I suspect that valudate action is run as a non-root user.


As far as I know, both the direct command and crm_resource **should**
be running the agent as the same user, as long as Madison is running
both commands as the same user.

For what it's worth, I copied your test script to my machine (Fedora
36 using the current upstream main of Pacemaker) and it worked fine
both directly and via crm_resource. At the moment I'm not able to dig
very deeply, but I do wonder if it's either a bug that's since been
fixed, or perhaps an environment issue.

To try to rule out the former, do you have a test environment where
you can try to reproduce it on the latest Pacemaker from upstream?


I am running both as the same (root, direct ssh, not sudo'd) user. I run 
them back-to-back with consistent results.


I've not built pacemaker in ages. Is there a src.rpm that's likely to 
build against centos stream 8 I could try? If not, do you know the 
command off and hand to create the rpm's from source? If not, I'll grab 
the source and read the docs for configure.


Never mind, I've got it building. Will test shortly.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 01:59, Reid Wahl wrote:

On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
 wrote:


I suspect that valudate action is run as a non-root user.


As far as I know, both the direct command and crm_resource **should**
be running the agent as the same user, as long as Madison is running
both commands as the same user.

For what it's worth, I copied your test script to my machine (Fedora
36 using the current upstream main of Pacemaker) and it worked fine
both directly and via crm_resource. At the moment I'm not able to dig
very deeply, but I do wonder if it's either a bug that's since been
fixed, or perhaps an environment issue.

To try to rule out the former, do you have a test environment where
you can try to reproduce it on the latest Pacemaker from upstream?


I am running both as the same (root, direct ssh, not sudo'd) user. I run 
them back-to-back with consistent results.


I've not built pacemaker in ages. Is there a src.rpm that's likely to 
build against centos stream 8 I could try? If not, do you know the 
command off and hand to create the rpm's from source? If not, I'll grab 
the source and read the docs for configure.



Madison Kelly  11 января 2023 г. 07:06:55 написал:


On 2023-01-11 00:21, Madison Kelly wrote:


On 2023-01-11 00:14, Madison Kelly wrote:


Hi all,

Edit: Last message was in HTML format, sorry about that.

I've got a hell of a weird problem, and I am absolutely stumped on
what's going on.

The short of it is; if my RA is called from the command line, it's
fine. If a resource exists, monitor, enable, disable, all that stuff
works just fine. If I try to create a resource, it hangs on the
validate stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server
--provider alteeve --option name=

Specifically, it hangs when it tries to make a shell call (to
virsh, specifically, but that doesn't matter). So to debug, I started
stripping down my RA simpler and simpler until I was left with the
very most basic of programs;

https://pastebin.com/VtSpkwMr

That is literally the simplest program I could write that made the
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
srv04-test; echo rc:$?


real0m0.061s
user0m0.037s
sys0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call
as well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent
server --provider alteeve --option name=srv04-test; echo rc:$?




  
  


  
crm_resource: Error performing operation: Error
occurred
  



real0m20.521s
user0m0.022s
sys0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
/usr/bin/echo return_code:0 |]


Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/



After sending this, I tried having my "RA" call 'hostname', and that
worked fine. I switched back to 'virsh list --all', and that hangs. So
it seems to somehow be related to call 'virsh' specifically.



OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.

First, I see in the logs:


Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer


So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:


line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
line: [brk(NULL)   = 0x562b7877d000]
line: [brk(0x562b787aa000) = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]


Around line 959 in the bad paste. There are more brk() lines, and not
long after the output stops.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



_

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Madison Kelly


On 2023-01-11 07:46, Bob Peterson wrote:

On 1/11/23 1:06 AM, Madison Kelly wrote:

On 2023-01-11 00:21, Madison Kelly wrote:

On 2023-01-11 00:14, Madison Kelly wrote:

Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped 
on what's going on.


   The short of it is; if my RA is called from the command line, 
it's fine. If a resource exists, monitor, enable, disable, all that 
stuff works just fine. If I try to create a resource, it hangs on 
the validate stage. Specifically, it hangs when 'pcs' calls:


crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=


   Specifically, it hangs when it tries to make a shell call (to 
virsh, specifically, but that doesn't matter). So to debug, I 
started stripping down my RA simpler and simpler until I was left 
with the very most basic of programs;


https://pastebin.com/VtSpkwMr


In the failing case do you get any interesting messages on the console 
or in dmesg?


Bob Peterson


Nope, nothing in dmesg. At the console, I see:


[root@mk-a07n02 ~]# crm_resource --validate --output-as xml --class ocf 
--agent server --provider alteeve --option name=srv04-test



  provider="alteeve">


execution_message="Timed Out" reason="Resource agent did not exit within 
specified timeout"/>

  
  

  crm_resource: Error performing operation: Error 
occurred


  



--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)

2023-01-10 Thread Madison Kelly




On January 11, 2023 2:26:57 a.m. EST, Ulrich Windl 
 wrote:
>>>> Madison Kelly  schrieb am 11.01.2023 um 06:21 in 
>>>> Nachricht
><74df2c8e-1cff-ba07-7f4a-070be296b...@alteeve.com>:
>> On 2023-01-11 00:14, Madison Kelly wrote:
>>> Hi all,
>>> 
>>> Edit: Last message was in HTML format, sorry about that.
>>> 
>>>I've got a hell of a weird problem, and I am absolutely stumped on 
>>> what's going on.
>>> 
>>>The short of it is; if my RA is called from the command line, it's 
>>> fine. If a resource exists, monitor, enable, disable, all that stuff 
>>> works just fine. If I try to create a resource, it hangs on the validate 
>>> stage. Specifically, it hangs when 'pcs' calls:
>>> 
>>> crm_resource --validate --output-as xml --class ocf --agent server 
>>> --provider alteeve --option name=
>>> 
>>>Specifically, it hangs when it tries to make a shell call (to virsh, 
>>> specifically, but that doesn't matter). So to debug, I started stripping 
>>> down my RA simpler and simpler until I was left with the very most basic 
>>> of programs;
>>> 
>>> https://pastebin.com/VtSpkwMr 
>>> 
>>>That is literally the simplest program I could write that made the 
>>> shell call. The 'open()' call is where it hangs.
>>> 
>>> When I call directly;
>>> 
>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
>>> srv04-test; echo rc:$?
>>> 
>>> 
>>> real0m0.061s
>>> user0m0.037s
>>> sys0m0.014s
>>> rc:0
>>> 
>>> 
>>> It's just fine. I can see in the log the output from the 'virsh' call as 
>>> well. However, when I call from crm_resource;
>>> 
>>> time crm_resource --validate --output-as xml --class ocf --agent server 
>>> --provider alteeve --option name=srv04-test; echo rc:$?
>>> 
>>> 
>>> 
>>>>> provider="alteeve">
>>>  
>>>  >> execution_message="Timed Out" reason="Resource agent did not exit within 
>>> specified timeout"/>
>>>
>>>
>>>  
>>>crm_resource: Error performing operation: Error 
>>> occurred
>>>  
>>>
>>> 
>>> 
>>> real0m20.521s
>>> user0m0.022s
>>> sys0m0.010s
>>> rc:1
>>> 
>>> 
>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>> 
>>> 
>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
>>> /usr/bin/echo return_code:0 |]
>>> 
>
>In VirtualDomain RA I found a similar command (assuming that works):
> virsh $VIRSH_OPTIONS dumpxml --inactive --security-info ${DOMAIN_NAME} >
> ${CFGTMP}
>
>virsh is somewhat strange; libvirtd is running, right?

Yes, I can call the RA directly, then immediately call crm_resource, or reverse 
order, always the same results.

Again, same calls work fine when enabling, disabling, etc. So weird...

>>> 
>>> Then nothing else.
>>> 
>>> The strace output is: https://pastebin.com/raw/UCEUdBeP 
>>> 
>>> Environment;
>>> 
>>> * selinux is permissive
>>> * Pacemaker 2.1.5-4.el8
>>> * pcs 0.10.15
>>> * 4.18.0-408.el8.x86_64
>>> * CentOS Stream release 8
>>> 
>>> Any help is appreciated, I am stumped. :/
>> 
>> After sending this, I tried having my "RA" call 'hostname', and that 
>> worked fine. I switched back to 'virsh list --all', and that hangs. So 
>> it seems to somehow be related to call 'virsh' specifically.
>> 
>> -- 
>> Madison Kelly
>> Alteeve's Niche!
>> Chief Technical Officer
>> c: +1-647-471-0951
>> https://alteeve.com/ 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>
>
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-10 Thread Madison Kelly


On 2023-01-11 00:21, Madison Kelly wrote:

On 2023-01-11 00:14, Madison Kelly wrote:

Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped on 
what's going on.


   The short of it is; if my RA is called from the command line, it's 
fine. If a resource exists, monitor, enable, disable, all that stuff 
works just fine. If I try to create a resource, it hangs on the 
validate stage. Specifically, it hangs when 'pcs' calls:


crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=


   Specifically, it hangs when it tries to make a shell call (to 
virsh, specifically, but that doesn't matter). So to debug, I started 
stripping down my RA simpler and simpler until I was left with the 
very most basic of programs;


https://pastebin.com/VtSpkwMr

   That is literally the simplest program I could write that made the 
shell call. The 'open()' call is where it hangs.


When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
srv04-test; echo rc:$?



real    0m0.061s
user    0m0.037s
sys    0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call 
as well. However, when I call from crm_resource;


time crm_resource --validate --output-as xml --class ocf --agent 
server --provider alteeve --option name=srv04-test; echo rc:$?




   provider="alteeve">

 
 execution_message="Timed Out" reason="Resource agent did not exit 
within specified timeout"/>

   
   
 
   crm_resource: Error performing operation: Error 
occurred

 
   


real    0m20.521s
user    0m0.022s
sys    0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
/usr/bin/echo return_code:0 |]



Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/


After sending this, I tried having my "RA" call 'hostname', and that 
worked fine. I switched back to 'virsh list --all', and that hangs. So 
it seems to somehow be related to call 'virsh' specifically.




OK, so more info... Knowing now that it's a problem with the virsh call 
specifically (but only when validating, existing VMs monitor, enable, 
disable fine, all which repeatedly call virsh), I noticed a few things.


First, I see in the logs:


Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: 
Connection reset by peer



So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good 
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" 
exists, which doesn't in the bad paste. Shortly after, I start seeing:



line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
line: [brk(NULL)   = 0x562b7877d000]
line: [brk(0x562b787aa000) = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]


Around line 959 in the bad paste. There are more brk() lines, and not 
long after the output stops.


--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-10 Thread Madison Kelly


On 2023-01-11 00:14, Madison Kelly wrote:

Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped on 
what's going on.


   The short of it is; if my RA is called from the command line, it's 
fine. If a resource exists, monitor, enable, disable, all that stuff 
works just fine. If I try to create a resource, it hangs on the validate 
stage. Specifically, it hangs when 'pcs' calls:


crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=


   Specifically, it hangs when it tries to make a shell call (to virsh, 
specifically, but that doesn't matter). So to debug, I started stripping 
down my RA simpler and simpler until I was left with the very most basic 
of programs;


https://pastebin.com/VtSpkwMr

   That is literally the simplest program I could write that made the 
shell call. The 'open()' call is where it hangs.


When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
srv04-test; echo rc:$?



real    0m0.061s
user    0m0.037s
sys    0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call as 
well. However, when I call from crm_resource;


time crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=srv04-test; echo rc:$?




   provider="alteeve">

     
     execution_message="Timed Out" reason="Resource agent did not exit within 
specified timeout"/>

   
   
     
   crm_resource: Error performing operation: Error 
occurred

     
   


real    0m20.521s
user    0m0.022s
sys    0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
/usr/bin/echo return_code:0 |]



Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/


After sending this, I tried having my "RA" call 'hostname', and that 
worked fine. I switched back to 'virsh list --all', and that hangs. So 
it seems to somehow be related to call 'virsh' specifically.


--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] RA hangs when called by crm_resource (resending text format)

2023-01-10 Thread Madison Kelly


Hi all,

Edit: Last message was in HTML format, sorry about that.

  I've got a hell of a weird problem, and I am absolutely stumped on 
what's going on.


  The short of it is; if my RA is called from the command line, it's 
fine. If a resource exists, monitor, enable, disable, all that stuff 
works just fine. If I try to create a resource, it hangs on the validate 
stage. Specifically, it hangs when 'pcs' calls:


crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=


  Specifically, it hangs when it tries to make a shell call (to virsh, 
specifically, but that doesn't matter). So to debug, I started stripping 
down my RA simpler and simpler until I was left with the very most basic 
of programs;


https://pastebin.com/VtSpkwMr

  That is literally the simplest program I could write that made the 
shell call. The 'open()' call is where it hangs.


When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
srv04-test; echo rc:$?



real0m0.061s
user0m0.037s
sys0m0.014s
rc:0


It's just fine. I can see in the log the output from the 'virsh' call as 
well. However, when I call from crm_resource;


time crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=srv04-test; echo rc:$?




  provider="alteeve">


execution_message="Timed Out" reason="Resource agent did not exit within 
specified timeout"/>

  
  

  crm_resource: Error performing operation: Error 
occurred


  


real0m20.521s
user0m0.022s
sys0m0.010s
rc:1


In the log file, I see (from line 20 of the super-simple-test-script):


Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
/usr/bin/echo return_code:0 |]



Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/
--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] RA hangs when called by crm_resource

2023-01-10 Thread Madison Kelly


  
  
Hi all,
  I've got a hell of a weird problem, and I am absolutely stumped
  on what's going on. 
  The short of it is; if my RA is called from the command line,
  it's fine. If a resource exists, monitor, enable, disable, all
  that stuff works just fine. If I try to create a resource, it
  hangs on the validate stage. Specifically, it hangs when 'pcs'
  calls: 

crm_resource --validate --output-as xml --class ocf --agent
  server --provider alteeve --option name=
  Specifically, it hangs when it tries to make a shell call (to
  virsh, specifically, but that doesn't matter). So to debug, I
  started stripping down my RA simpler and simpler until I was left
  with the very most basic of programs;
https://pastebin.com/VtSpkwMr
  That is literally the simplest program I could write that made
  the shell call. The 'open()' call is where it hangs. 

When I call directly;
time /usr/lib/ocf/resource.d/alteeve/server --validate-all
  --server srv04-test; echo rc:$?
  
  
  real    0m0.061s
  user    0m0.037s
  sys    0m0.014s
  rc:0
  
It's just fine. I can see in the log the output from the 'virsh'
  call as well. However, when I call from crm_resource;
time crm_resource --validate --output-as xml --class ocf --agent
  server --provider alteeve --option name=srv04-test; echo rc:$?


  
    
      
      
    
    
      
    crm_resource: Error performing operation: Error
  occurred
      
    
  
  
  real    0m20.521s
  user    0m0.022s
  sys    0m0.010s
  rc:1
  
In the log file, I see (from line 20 of the
  super-simple-test-script):

  Calling: [/usr/bin/virsh dumpxml --inactive srv04-test
  2>&1; /usr/bin/echo return_code:0 |]
  

Then nothing else. 

The strace output is: https://pastebin.com/raw/UCEUdBeP
Environment;

* selinux is permissive
  * Pacemaker 2.1.5-4.el8
  * pcs 0.10.15
  * 4.18.0-408.el8.x86_64
  * CentOS Stream release 8

Any help is appreciated, I am stumped. :/

    -- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Preventing a resource from migrating to / starting on a node

2022-11-29 Thread Madison Kelly


  
  
On 2022-11-29 00:31, Reid Wahl wrote:


  On Mon, Nov 28, 2022 at 8:21 PM Madison Kelly  wrote:

  

This question builds on questions I was talking to kgaillot on IRC.

I am try to prevent a resource from being allowed to migrate to or start on a given node. When I asked about this, Ken talked about node attributes, which I've been trying to implement.

To try to figure this out / test this, I setup an attribute against a resource called 'srv01-sql' called 'drbd-fenced_srv01-psql' that sets a location constraint of -INFINITY. I had the resource running on 'mk-a01n01' and then set 'drbd-fenced_srv01-psql=1' to trigger the constraint against 'mk-a01n02'. I verified this was set, then tried migrating it, and it happily migrated.

Clearly I am missing something. :)


[root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02 --name drbd-fenced_srv01-sql --query
scope=nodes  name=drbd-fenced_srv01-sql value=1

[root@mk-a01n01 ~]# pcs constraint location config
Location Constraints:
  Resource: srv01-sql
Enabled on:
  Node: mk-a01n02 (score:100)
  Node: mk-a01n01 (score:200)
Constraint: location-srv01-sql
  Rule: score=-INFINITY
_expression_: drbd-fenced_srv01-sql eq 0
  Resource: srv02-web
Enabled on:
  Node: mk-a01n02 (score:100)
  Node: mk-a01n01 (score:200)

[root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02 --name drbd-fenced_srv01-sql --query
scope=nodes  name=drbd-fenced_srv01-sql value=1

[root@mk-a01n01 ~]# pcs resource status srv01-sql
  * srv01-sql(ocf::alteeve:server): Started mk-a01n01

[root@mk-a01n01 ~]# pcs constraint location srv01-sql prefers mk-a01n02=200 mk-a01n01=100

[root@mk-a01n01 ~]# pcs resource status srv01-sql
  * srv01-sql(ocf::alteeve:server): Migrating mk-a01n01

[root@mk-a01n01 ~]# pcs resource status srv01-sql
  * srv01-sql(ocf::alteeve:server): Started mk-a01n02


I feel like this shouldn't be so complicated, so I am likely over-thinking this, or missing something obvious...

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

  
  
The configured rule prevents srv01-sql from running on a node where
the drbd-fenced_srv01-sql attribute is set to 0. It looks like it's
set to 1.

Maybe I'm misunderstanding though -- if I am, can you help clarify and
send the CIB so that I can mess around with it?


Excuse me one second...
"AARG!!"
OK, now I am better. Thank you, that was the problem. :)
    
-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Preventing a resource from migrating to / starting on a node

2022-11-29 Thread Madison Kelly


  
  
I was taking Ken's advice. Originally
  my plan was to use location constraints, but I assume Ken's
  reasoning was sound for the node attribute approach.



On 2022-11-29 02:51, Ulrich Windl
  wrote:


  Why can't you use a plain location constraint?


  

  
Madison Kelly  schrieb am 29.11.2022 um 05:21 in Nachricht

  

  


-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Preventing a resource from migrating to / starting on a node

2022-11-28 Thread Madison Kelly


  
  
This question builds on questions I was talking to kgaillot on
  IRC.
I am try to prevent a resource from being allowed to migrate to
  or start on a given node. When I asked about this, Ken talked
  about node attributes, which I've been trying to implement. 

To try to figure this out / test this, I setup an attribute
  against a resource called 'srv01-sql' called
  'drbd-fenced_srv01-psql' that sets a location constraint of
  -INFINITY. I had the resource running on 'mk-a01n01' and then set
  'drbd-fenced_srv01-psql=1' to trigger the constraint against
  'mk-a01n02'. I verified this was set, then tried migrating it, and
  it happily migrated.
Clearly I am missing something. :)


  [root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02
  --name drbd-fenced_srv01-sql --query
  scope=nodes  name=drbd-fenced_srv01-sql value=1
  
  [root@mk-a01n01 ~]# pcs constraint location config
  Location Constraints:
    Resource: srv01-sql
      Enabled on:
    Node: mk-a01n02 (score:100)
    Node: mk-a01n01 (score:200)
      Constraint: location-srv01-sql
    Rule: score=-INFINITY
      _expression_: drbd-fenced_srv01-sql eq 0
    Resource: srv02-web
      Enabled on:
    Node: mk-a01n02 (score:100)
    Node: mk-a01n01 (score:200)
  
  [root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02
  --name drbd-fenced_srv01-sql --query
  scope=nodes  name=drbd-fenced_srv01-sql value=1
  
  [root@mk-a01n01 ~]# pcs resource status srv01-sql
    * srv01-sql    (ocf::alteeve:server):     Started mk-a01n01
  
  [root@mk-a01n01 ~]# pcs constraint location srv01-sql prefers
  mk-a01n02=200 mk-a01n01=100
  
  [root@mk-a01n01 ~]# pcs resource status srv01-sql
    * srv01-sql    (ocf::alteeve:server):     Migrating mk-a01n01
  
  [root@mk-a01n01 ~]# pcs resource status srv01-sql
    * srv01-sql    (ocf::alteeve:server):     Started mk-a01n02
  
I feel like this shouldn't be so complicated, so I am likely
  over-thinking this, or missing something obvious... 

-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] HA Summit 2023

2022-11-23 Thread Madison Kelly


  
  
Understanding that it's all but impossible to predict how covid
  will go, I think it's as good a time as any to start planning for
  the next HA Summit. 

Last time was in Brno hosted by Red Hat. So I suppose we can prod
  SUSE to host this time? SUSE folks, how does that sound? 

I'm thinking summer or fall of '23. 

Basically, consider this a "starting the ball rolling" and that's
  it. What makes sense to people? How comfortable would people be
  with restarting the HA Summits in person again? Any preference for
  location, timing, etc?
Madi
    
-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD and SQL Server

2022-09-28 Thread Madison Kelly


  
  
On 2022-09-28 04:36, Jehan-Guillaume de
  Rorthais wrote:


  On Wed, 28 Sep 2022 02:33:59 -0400
Madison Kelly  wrote:


  
...
I'm happy to go into more detail, but I'll stop here until/unless you have
more questions. Otherwise I'd write a book. :)

  
  
I would buy it ;)


Haha! Feel free to email me directly if you'd like, with specific
  questions and I'll go into detail. Avoid flooding the channel. :)
Madi

-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD and SQL Server

2022-09-28 Thread Madison Kelly


  
  
The way we do it is to create a VM that
  runs Windows and hosts the DB, so the OS nor the DB have to have
  any concept that there's replication behind the scenes.


Exactly how we do is specific to our
  platform (The Anvil), but in short we've got a custom RA for
  pacemaker and management tools to create LVs per VM, and those LVs
  become backing devices for a DRBD resource with 1 or more volumes.
  The resource runs in single-primary except when we want to live
  migrate, (all this is handled in our Pacemaker RA) then we enable
  dual-primary, promote the target to primary, migrate, demote the
  old host to Secondary and disable dual-primary support. 



Of course, protection is provided via
  IPMI fencing as the primary method with switched PDU fencing as a
  backup. 



I'm happy to go into more detail, but
  I'll stop here until/unless you have more questions. Otherwise I'd
  write a book. :)


Madi



On 2022-09-27 15:42, Eric Robinson
  wrote:


  
  
  
  
Hi Madi,
 
It sounds like you’ve had a lot of good
  experience. I’m trying to decide between paying a premium
  price for MSSQL Enterprise with Always-On Replication or just
  setting up an Active/Standby scenario with the Standard
  Edition of MSSQL running on DRBD. We have tons of experience
  with MySQL on DRBD, but not with MSSQL. When running MSSQL on
  DRBD, what’s the cluster stack? How does failover work? When
  using MySQL, the service only runs on one server at a time. In
  a failover, the writable data volume transitions to the
  standby server and then the MySQL service is started on it.
  Does it work the same way with MSQL?   
 
-Eric
 
 

  

  From: Madison Kelly
 
Sent: Monday, September 26, 2022 7:55 PM
To: Cluster Labs - All topics related to
open-source clustering welcomed
; Eric Robinson

Subject: Re: [ClusterLabs] DRBD and SQL Server

  
   
  
On 2022-09-25 23:49, Eric Robinson
  wrote:
  
  
Hey list, 
 
Anybody have experience running SQL
  Server on DRBD? I’d ask this in the DRBD list but that one
  is like a ghost town. This list is the next best option.
  
 
-Eric
  
  Extensively, yes. Albeit in VMs whose storage was backed by
DRBD, though for all practical purposes there's no real
difference. We've had clients running various DB servers for
over ten years spanning DRBD 8.3 through to the latest 9.1.
  What's your question?
  Madi
  -- 
  Madison Kelly
  Alteeve's Niche!
  Chief Technical Officer
  c: +1-647-471-0951
  https://alteeve.com/

  
  Disclaimer : This email and any files transmitted with it are
  confidential and intended solely for intended recipients. If you
  are not the named addressee you should not disseminate,
  distribute, copy or alter this email. Any views or opinions
  presented in this email are solely those of the author and might
  not represent those of Physician Select Management. Warning:
  Although Physician Select Management has taken reasonable
  precautions to ensure no viruses are present in this email, the
  company cannot accept responsibility for any loss or damage
  arising from the use of this email or attachments.



-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD and SQL Server

2022-09-26 Thread Madison Kelly


  
  
On 2022-09-25 23:49, Eric Robinson
  wrote:


  
  
  
  
Hey list, 
 
Anybody have experience running SQL Server
  on DRBD? I’d ask this in the DRBD list but that one is like a
  ghost town. This list is the next best option.
  
 
-Eric
  

Extensively, yes. Albeit in VMs whose storage was backed by DRBD,
  though for all practical purposes there's no real difference.
  We've had clients running various DB servers for over ten years
  spanning DRBD 8.3 through to the latest 9.1.
What's your question?

Madi

-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
  

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Detecting pacemaker version incompatibility during node rebuild

[ClusterLabs] When did the CIB change how it reports in_ccm and crmd?

Re: [ClusterLabs] Planning for Pacemaker 3

Re: [ClusterLabs] The Linux-HA site is down.

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't) - SOLVED!

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

[ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)

[ClusterLabs] RA hangs when called by crm_resource (resending text format)

[ClusterLabs] RA hangs when called by crm_resource

Re: [ClusterLabs] Preventing a resource from migrating to / starting on a node

Re: [ClusterLabs] Antw: [EXT] Preventing a resource from migrating to / starting on a node

[ClusterLabs] Preventing a resource from migrating to / starting on a node

[ClusterLabs] HA Summit 2023

Re: [ClusterLabs] DRBD and SQL Server

Re: [ClusterLabs] DRBD and SQL Server

Re: [ClusterLabs] DRBD and SQL Server

32 matches

Site Navigation

Mail list logo

Footer information