[ClusterLabs] Detecting pacemaker version incompatibility during node rebuild
Hi all, I'm working on a tool to rebuild a node that was lost. Given this scenario, upgrading the surviving node is not viable (at least, not until after the rebuild is completed and the services can be migrated). I ran into a problem where 'pcs cluster start' exits with RC 0, and it _looks_ like the cluster is starting, but then it exits without a message on STDOUT. In the logs though, I can see this; Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: notice: Node an-a01n01 state is now member Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: error: Local feature set (3.17.4) is incompatible with DC's (3.19.0) Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: notice: Forcing immediate exit with status 100 (Fatal error occurred, will not respawn) Jun 13 22:35:04 an-a01n02.alteeve.com pacemaker-controld[105161]: warning: Inhibiting respawn So I have two questions; 1. Is there a way to test (using pcs or another tool) to see if the local machine is compatible with the peer? 2. If the node being rebuilt isn't compatible, is there a way to tell it to start in a compatibility mode, or to tell the surviving peer to switch to a compatibility mode? Which depending on which is newer. Of course, in this particular test case, the node being rebuilt is behind the survivor, so the fix here is a simple update of pacemaker before rejoining. However in the real world, it's far more likely that the node being joined will be a newer version. The reason for this is that a large number of our deployments are in location with no or limited internet. So keeping the active cluster regularly updated is not feasible (and some clients "lock" their deployments to approved/tested versions). Thanks for any hints/tips! Madi -- wiki - https://alteeve.com/w cell - 647-471-0951 work - 647-417-7486 x 404 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] When did the CIB change how it reports in_ccm and crmd?
I'm sure it was announced and I missed it, but I just tripped over my pants when an update changed 'in_ccm' and 'crmd' in the CIB from 'true/false' to timestamps... When did that happen? Is there an announcement marking other changes that happened at the same time? Cheers, Madi -- wiki - https://alteeve.com/w cell - 647-471-0951 work - 647-417-7486 x 404 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Planning for Pacemaker 3
On 2024-01-03 12:06, Ken Gaillot wrote: Hi all, I'd like to release Pacemaker 3.0.0 around the middle of this year. I'm gathering proposed changes here: https://projects.clusterlabs.org/w/projects/pacemaker/pacemaker_3.0_changes/ Please review for anything that might affect you, and reply here if you have any concerns. Pacemaker major-version releases drop support for deprecated features, to make the code easier to maintain. The biggest planned changes are dropping support for Upstart and Nagios resources, as well as rolling upgrades from Pacemaker 1. Much of the lowest-level public C API will be dropped. Because the changes will be backward-incompatible, we will continue to make 2.1 releases for a few years, with backports of compatible fixes, to help distribution packagers who need to keep backward compatibility. If this is already a feature, this is going to sound silly... Would it be possible to trigger scripts if a resource or stonith device entered a FAILED state? -- wiki -https://alteeve.com/w cell - 647-471-0951 work - 647-417-7486 x 404 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] The Linux-HA site is down.
On 2023-05-03 05:26, 黃暄皓 wrote: As the title said,is it still in maintenance? I'm not sure who even owns or maintains that old domain. I don't think it's been used or maintained for a long time. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
On 2023-01-12 04:50, Keisuke MORI wrote: Hi, Just a guess but could it be the same issue with this? https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background That was exactly what it was! Bandini linked the same thing last night. I fixed it by calling 'setsid --wait virsh '. Thanks! 2023年1月12日(木) 15:36 Madison Kelly : On 2023-01-12 01:26, Reid Wahl wrote: On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly wrote: On 2023-01-12 01:12, Reid Wahl wrote: On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly wrote: Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. The strace outputs show that your bad runs are all getting stopped with SIGTTOU. If you've never heard of that, me either. The hell?! This is new to me also. https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html Macro: int SIGTTOU This is similar to SIGTTIN, but is generated when a process in a background job attempts to write to the terminal or set its modes. Again, the default action is to stop the process. SIGTTOU is only generated for an attempt to write to the terminal if the TOSTOP output mode is set; see Output Modes. Maybe this has something to do with the buffer settings in the perl script(?). It might be worth trying a version that doesn't fiddle with the outputs and buffer settings. I tried removing the $|, and then I changed the script to be entirely a bash script, still hanging. I tried 'virsh --connect list --all' where method was qemu:///system, qemu:///session, and ssh+qemu:///root@localhost/system, all hang. In bash or perl. I don't know which difference between your environment and mine is relevant here, such that I can't reproduce the issue using your test script. It works perfectly fine for me. Can you run `stty -a | grep tostop`? If there's a minus sign ("-tostop"), it's disabled; if it's present without a minus sign ("tostop"), it's enabled, as best I can tell. -tostop is there [root@mk-a07n02 ~]# stty -a | grep tostop isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt [root@mk-a07n02 ~]# I'm just spitballing here. It's disabled by default on my machine... but even when I enable it, crm_resource --validate works fine. It may be set differently when running under crm_resource. How do you enable it? With `stty tostop` It's 100% possible that this whole thing is a red herring by the way. I'm looking for anything that might explain the discrepancy. SIGTTOU may not be directly tied to the root cause. Appreciate the stab, didn't stop the hang though :( -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't) - SOLVED!
On 2023-01-11 23:10, Madison Kelly wrote: Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. bandini found the problem https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background /usr/bin/setsid --wait /usr/bin/virsh list --all That fixed it. omg. I'm going to sleep. holy crap. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
On 2023-01-12 01:32, Vladislav Bogdanov wrote: What would be the reason of running that command without redirecting its output somewhere? In the real RA I am. I made a super stripped down test script to figure out how to make any call to virsh that didn't end up with it hanging. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
On 2023-01-12 01:28, Reid Wahl wrote: On Wed, Jan 11, 2023 at 10:27 PM Reid Wahl wrote: On Wed, Jan 11, 2023 at 10:26 PM Reid Wahl wrote: On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly wrote: On 2023-01-12 01:12, Reid Wahl wrote: On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly wrote: Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. The strace outputs show that your bad runs are all getting stopped with SIGTTOU. If you've never heard of that, me either. The hell?! This is new to me also. https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html Macro: int SIGTTOU This is similar to SIGTTIN, but is generated when a process in a background job attempts to write to the terminal or set its modes. Again, the default action is to stop the process. SIGTTOU is only generated for an attempt to write to the terminal if the TOSTOP output mode is set; see Output Modes. Maybe this has something to do with the buffer settings in the perl script(?). It might be worth trying a version that doesn't fiddle with the outputs and buffer settings. I tried removing the $|, and then I changed the script to be entirely a bash script, still hanging. I tried 'virsh --connect list --all' where method was qemu:///system, qemu:///session, and ssh+qemu:///root@localhost/system, all hang. In bash or perl. I don't know which difference between your environment and mine is relevant here, such that I can't reproduce the issue using your test script. It works perfectly fine for me. Can you run `stty -a | grep tostop`? If there's a minus sign ("-tostop"), it's disabled; if it's present without a minus sign ("tostop"), it's enabled, as best I can tell. -tostop is there [root@mk-a07n02 ~]# stty -a | grep tostop isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt [root@mk-a07n02 ~]# I'm just spitballing here. It's disabled by default on my machine... but even when I enable it, crm_resource --validate works fine. It may be set differently when running under crm_resource. How do you enable it? With `stty tostop` If anything it should be disabled though I'd be very interested in whether anyone else can reproduce this with your test script So would I!! -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
On 2023-01-12 01:26, Reid Wahl wrote: On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly wrote: On 2023-01-12 01:12, Reid Wahl wrote: On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly wrote: Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. The strace outputs show that your bad runs are all getting stopped with SIGTTOU. If you've never heard of that, me either. The hell?! This is new to me also. https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html Macro: int SIGTTOU This is similar to SIGTTIN, but is generated when a process in a background job attempts to write to the terminal or set its modes. Again, the default action is to stop the process. SIGTTOU is only generated for an attempt to write to the terminal if the TOSTOP output mode is set; see Output Modes. Maybe this has something to do with the buffer settings in the perl script(?). It might be worth trying a version that doesn't fiddle with the outputs and buffer settings. I tried removing the $|, and then I changed the script to be entirely a bash script, still hanging. I tried 'virsh --connect list --all' where method was qemu:///system, qemu:///session, and ssh+qemu:///root@localhost/system, all hang. In bash or perl. I don't know which difference between your environment and mine is relevant here, such that I can't reproduce the issue using your test script. It works perfectly fine for me. Can you run `stty -a | grep tostop`? If there's a minus sign ("-tostop"), it's disabled; if it's present without a minus sign ("tostop"), it's enabled, as best I can tell. -tostop is there [root@mk-a07n02 ~]# stty -a | grep tostop isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt [root@mk-a07n02 ~]# I'm just spitballing here. It's disabled by default on my machine... but even when I enable it, crm_resource --validate works fine. It may be set differently when running under crm_resource. How do you enable it? With `stty tostop` It's 100% possible that this whole thing is a red herring by the way. I'm looking for anything that might explain the discrepancy. SIGTTOU may not be directly tied to the root cause. Appreciate the stab, didn't stop the hang though :( -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
On 2023-01-12 01:12, Reid Wahl wrote: On Wed, Jan 11, 2023 at 10:12 PM Reid Wahl wrote: On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly wrote: Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. The strace outputs show that your bad runs are all getting stopped with SIGTTOU. If you've never heard of that, me either. https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html Macro: int SIGTTOU This is similar to SIGTTIN, but is generated when a process in a background job attempts to write to the terminal or set its modes. Again, the default action is to stop the process. SIGTTOU is only generated for an attempt to write to the terminal if the TOSTOP output mode is set; see Output Modes. Maybe this has something to do with the buffer settings in the perl script(?). It might be worth trying a version that doesn't fiddle with the outputs and buffer settings. I don't know which difference between your environment and mine is relevant here, such that I can't reproduce the issue using your test script. It works perfectly fine for me. Can you run `stty -a | grep tostop`? If there's a minus sign ("-tostop"), it's disabled; if it's present without a minus sign ("tostop"), it's enabled, as best I can tell. I'm just spitballing here. It's disabled by default on my machine... but even when I enable it, crm_resource --validate works fine. It may be set differently when running under crm_resource. I meant to include this: https://stackoverflow.com/questions/10588334/unix-background-process-stopped-abnormally If I understand the post; [root@mk-a07n02 ~]# /usr/bin/nohup perl /usr/lib/ocf/resource.d/alteeve/server /usr/bin/nohup: ignoring input and appending output to 'nohup.out' [root@mk-a07n02 ~]# I see the output of the virsh call in the logs fine, no hang. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
On 2023-01-12 01:12, Reid Wahl wrote: On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly wrote: Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. The strace outputs show that your bad runs are all getting stopped with SIGTTOU. If you've never heard of that, me either. The hell?! This is new to me also. https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html Macro: int SIGTTOU This is similar to SIGTTIN, but is generated when a process in a background job attempts to write to the terminal or set its modes. Again, the default action is to stop the process. SIGTTOU is only generated for an attempt to write to the terminal if the TOSTOP output mode is set; see Output Modes. Maybe this has something to do with the buffer settings in the perl script(?). It might be worth trying a version that doesn't fiddle with the outputs and buffer settings. I tried removing the $|, and then I changed the script to be entirely a bash script, still hanging. I tried 'virsh --connect list --all' where method was qemu:///system, qemu:///session, and ssh+qemu:///root@localhost/system, all hang. In bash or perl. I don't know which difference between your environment and mine is relevant here, such that I can't reproduce the issue using your test script. It works perfectly fine for me. Can you run `stty -a | grep tostop`? If there's a minus sign ("-tostop"), it's disabled; if it's present without a minus sign ("tostop"), it's enabled, as best I can tell. -tostop is there [root@mk-a07n02 ~]# stty -a | grep tostop isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt [root@mk-a07n02 ~]# I'm just spitballing here. It's disabled by default on my machine... but even when I enable it, crm_resource --validate works fine. It may be set differently when running under crm_resource. How do you enable it? -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)
Hi all, There was a lot of sub-threads, so I figured it's helpful to start a new thread with a summary so far. For context; I have a super simple perl script that pretends to be an RA for the sake of debugging. https://pastebin.com/9z314TaB I've had variations log environment variables and confirmed that all the variables in the direct call that work are in the crm_resource triggered call. There are no selinux issues logged in audit.log and selinux is permissive. The script logs the real and effective UID and GID and it's the same in both instances. Calling other shell programs (tested with 'hostname') run fine, this is specifically crm_resource -> test RA -> virsh call. I ran strace on the virsh call from inside my test script (changing 'virsh.good' to 'virsh.bad' between running directly and via crm_resource. The strace runs made six files each time. Below are pastebin links with the outputs of the six runs in one paste, but each file's output is in it's own block (search for file: to see the different file outputs) Good/direct run of the test RA: - https://pastebin.com/xtqe9NSG Bad/crm_resource triggered run of the test RA: - https://pastebin.com/vBiLVejW Still absolutely stumped. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 14:49, Ken Gaillot wrote: On Wed, 2023-01-11 at 14:09 -0500, Madison Kelly wrote: On 2023-01-11 14:01, Madison Kelly wrote: On 2023-01-11 01:59, Reid Wahl wrote: On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. As far as I know, both the direct command and crm_resource **should** be running the agent as the same user, as long as Madison is running both commands as the same user. For what it's worth, I copied your test script to my machine (Fedora 36 using the current upstream main of Pacemaker) and it worked fine both directly and via crm_resource. At the moment I'm not able to dig very deeply, but I do wonder if it's either a bug that's since been fixed, or perhaps an environment issue. To try to rule out the former, do you have a test environment where you can try to reproduce it on the latest Pacemaker from upstream? I am running both as the same (root, direct ssh, not sudo'd) user. I run them back-to-back with consistent results. I've not built pacemaker in ages. Is there a src.rpm that's likely to build against centos stream 8 I could try? If not, do you know the command off and hand to create the rpm's from source? If not, I'll grab the source and read the docs for configure. Never mind, I've got it building. Will test shortly. FYI, you can run "make -C rpm rpm" from a source checkout. [root@mk-a07n02 RPMS]# pacemakerd --version Pacemaker 2.1.5-1.39e62b78e.git.el8 Build from main just now, same issue. :/ -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 16:23, Vladislav Bogdanov wrote: And, one more thing can affect that - selinux. I doubt, but that's worth checking. selinux is permissive, and nothing written to audit.log. Side note; I checked GID and effective GID and well as UID and EUID, all 0. I recorded environment variables, and removed from below matching ones. Here's the differences; -=] Direct Environment: [_] -> [/usr/lib/ocf/resource.d/alteeve/server] -=] crm_resource Environment: [HA_debug] -> [0] Environment: [HA_logfacility] -> [none] Environment: [OCF_EXIT_REASON_PREFIX] -> [ocf-exit-reason:] Environment: [OCF_OUTPUT_FORMAT] -> [xml] Environment: [OCF_RA_VERSION_MAJOR] -> [1] Environment: [OCF_RA_VERSION_MINOR] -> [1] Environment: [OCF_RESKEY_CRM_meta_timeout] -> [2] Environment: [OCF_RESKEY_crm_feature_set] -> [3.16.2] Environment: [OCF_RESKEY_name] -> [srv04-test] Environment: [OCF_RESOURCE_INSTANCE] -> [test] Environment: [OCF_RESOURCE_PROVIDER] -> [alteeve] Environment: [OCF_RESOURCE_TYPE] -> [server] Environment: [OCF_ROOT] -> [/usr/lib/ocf] Environment: [OCF_TRACE_FILE] -> [/dev/stderr] Environment: [PCMK_logfacility] -> [none] Environment: [PCMK_service] -> [crm_resource] Environment: [_] -> [/usr/sbin/crm_resource] Vladislav Bogdanov 11 января 2023 г. 22:21:03 написал: Then I would suggest to log all env vars and compare them, probably something is missing in validate for virsh to be happy. Madison Kelly 11 января 2023 г. 22:06:45 написал: On 2023-01-11 01:13, Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. I modified the script to log the real and effective UIDs and it's running as root in both instances. Madison Kelly 11 января 2023 г. 07:06:55 написал: On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real 0m0.061s user 0m0.037s sys 0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? name=srv04-test"> crm_resource: Error performing operation: Error occurred real 0m20.521s user 0m0.022s sys 0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. OK, so more info... Knowing now that it's a problem with the virsh call specifically (but only when validating, existing VMs monitor, enable, disable fine, all which repeatedly call virsh), I noticed a few things. First, I see in the logs: Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: Connection reset by peer So with this, I further simplified my test script to this: https://pastebin.com/Ey8FdL1t Then when I ran my test script directly, the strace output is: Good: https://pastebin.com/Trbq67ub When my script is called via crm_resource, the strace is this: Bad: https://pastebin.com/jtbzHrUM The first difference I can see happens around line 929 in the good paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" exists, which doesn't in the bad paste. Shortly after, I start seeing: line: [write(4, "\1\0\0\0\0\0\0\0", 8)
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 15:55, Reid Wahl wrote: On Wed, Jan 11, 2023 at 12:48 PM Madison Kelly wrote: On 2023-01-11 01:59, Reid Wahl wrote: On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. As far as I know, both the direct command and crm_resource **should** be running the agent as the same user, as long as Madison is running both commands as the same user. For what it's worth, I copied your test script to my machine (Fedora 36 using the current upstream main of Pacemaker) and it worked fine both directly and via crm_resource. At the moment I'm not able to dig very deeply, but I do wonder if it's either a bug that's since been fixed, or perhaps an environment issue. To try to rule out the former, do you have a test environment where you can try to reproduce it on the latest Pacemaker from upstream? I built the pacemaker source RPM from Fedora 37, then realized I'm already running 2.1.5 on CS8, so I'm already on the latest release. Looking at git, 2.1.5 is the latest tagged release... Are you running newer than that? I'm running on the current main, which contains commits that came after the 2.1.5 release. I don't really expect this to be a Pacemaker bug, especially with how recent your version is, but I would like to rule that out if possible. You would have either the src.rpm or the ./configure options you used off hand? Madison Kelly 11 января 2023 г. 07:06:55 написал: On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real0m0.061s user0m0.037s sys0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? crm_resource: Error performing operation: Error occurred real0m20.521s user0m0.022s sys0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. OK, so more info... Knowing now that it's a problem with the virsh call specifically (but only when validating, existing VMs monitor, enable, disable fine, all which repeatedly call virsh), I noticed a few things. First, I see in the logs: Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: Connection reset by peer So with this, I further simplified my test script to this: https://pastebin.com/Ey8FdL1t Then when I ran my test script directly, the strace output is: Good: https://pastebin.com/Trbq67ub When my script is called via crm_resource, the strace is this: Bad: https://pastebin.com/jtbzHrUM The first difference I can see happens around line 929 in the good paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" exists, which doesn't in the bad paste. Shortly after, I start seeing: line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] line: [brk(NULL) = 0x562b7877d000] line: [brk(0x562b787aa000) = 0x562b787aa000] line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] Around line 959 in the bad paste. There are more brk() lines, and not long after the output stops. -- Madison Kelly Alteeve's Ni
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 01:13, Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. I modified the script to log the real and effective UIDs and it's running as root in both instances. Madison Kelly 11 января 2023 г. 07:06:55 написал: On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real 0m0.061s user 0m0.037s sys 0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? crm_resource: Error performing operation: Error occurred real 0m20.521s user 0m0.022s sys 0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. OK, so more info... Knowing now that it's a problem with the virsh call specifically (but only when validating, existing VMs monitor, enable, disable fine, all which repeatedly call virsh), I noticed a few things. First, I see in the logs: Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: Connection reset by peer So with this, I further simplified my test script to this: https://pastebin.com/Ey8FdL1t Then when I ran my test script directly, the strace output is: Good: https://pastebin.com/Trbq67ub When my script is called via crm_resource, the strace is this: Bad: https://pastebin.com/jtbzHrUM The first difference I can see happens around line 929 in the good paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" exists, which doesn't in the bad paste. Shortly after, I start seeing: line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] line: [brk(NULL) = 0x562b7877d000] line: [brk(0x562b787aa000) = 0x562b787aa000] line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] Around line 959 in the bad paste. There are more brk() lines, and not long after the output stops. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 01:59, Reid Wahl wrote: On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. As far as I know, both the direct command and crm_resource **should** be running the agent as the same user, as long as Madison is running both commands as the same user. For what it's worth, I copied your test script to my machine (Fedora 36 using the current upstream main of Pacemaker) and it worked fine both directly and via crm_resource. At the moment I'm not able to dig very deeply, but I do wonder if it's either a bug that's since been fixed, or perhaps an environment issue. To try to rule out the former, do you have a test environment where you can try to reproduce it on the latest Pacemaker from upstream? I built the pacemaker source RPM from Fedora 37, then realized I'm already running 2.1.5 on CS8, so I'm already on the latest release. Looking at git, 2.1.5 is the latest tagged release... Are you running newer than that? Madison Kelly 11 января 2023 г. 07:06:55 написал: On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real0m0.061s user0m0.037s sys0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? crm_resource: Error performing operation: Error occurred real0m20.521s user0m0.022s sys0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. OK, so more info... Knowing now that it's a problem with the virsh call specifically (but only when validating, existing VMs monitor, enable, disable fine, all which repeatedly call virsh), I noticed a few things. First, I see in the logs: Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: Connection reset by peer So with this, I further simplified my test script to this: https://pastebin.com/Ey8FdL1t Then when I ran my test script directly, the strace output is: Good: https://pastebin.com/Trbq67ub When my script is called via crm_resource, the strace is this: Bad: https://pastebin.com/jtbzHrUM The first difference I can see happens around line 929 in the good paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" exists, which doesn't in the bad paste. Shortly after, I start seeing: line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] line: [brk(NULL) = 0x562b7877d000] line: [brk(0x562b787aa000) = 0x562b787aa000] line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] Around line 959 in the bad paste. There are more brk() lines, and not long after the output stops. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 14:01, Madison Kelly wrote: On 2023-01-11 01:59, Reid Wahl wrote: On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. As far as I know, both the direct command and crm_resource **should** be running the agent as the same user, as long as Madison is running both commands as the same user. For what it's worth, I copied your test script to my machine (Fedora 36 using the current upstream main of Pacemaker) and it worked fine both directly and via crm_resource. At the moment I'm not able to dig very deeply, but I do wonder if it's either a bug that's since been fixed, or perhaps an environment issue. To try to rule out the former, do you have a test environment where you can try to reproduce it on the latest Pacemaker from upstream? I am running both as the same (root, direct ssh, not sudo'd) user. I run them back-to-back with consistent results. I've not built pacemaker in ages. Is there a src.rpm that's likely to build against centos stream 8 I could try? If not, do you know the command off and hand to create the rpm's from source? If not, I'll grab the source and read the docs for configure. Never mind, I've got it building. Will test shortly. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 01:59, Reid Wahl wrote: On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov wrote: I suspect that valudate action is run as a non-root user. As far as I know, both the direct command and crm_resource **should** be running the agent as the same user, as long as Madison is running both commands as the same user. For what it's worth, I copied your test script to my machine (Fedora 36 using the current upstream main of Pacemaker) and it worked fine both directly and via crm_resource. At the moment I'm not able to dig very deeply, but I do wonder if it's either a bug that's since been fixed, or perhaps an environment issue. To try to rule out the former, do you have a test environment where you can try to reproduce it on the latest Pacemaker from upstream? I am running both as the same (root, direct ssh, not sudo'd) user. I run them back-to-back with consistent results. I've not built pacemaker in ages. Is there a src.rpm that's likely to build against centos stream 8 I could try? If not, do you know the command off and hand to create the rpm's from source? If not, I'll grab the source and read the docs for configure. Madison Kelly 11 января 2023 г. 07:06:55 написал: On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real0m0.061s user0m0.037s sys0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? crm_resource: Error performing operation: Error occurred real0m20.521s user0m0.022s sys0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. OK, so more info... Knowing now that it's a problem with the virsh call specifically (but only when validating, existing VMs monitor, enable, disable fine, all which repeatedly call virsh), I noticed a few things. First, I see in the logs: Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: Connection reset by peer So with this, I further simplified my test script to this: https://pastebin.com/Ey8FdL1t Then when I ran my test script directly, the strace output is: Good: https://pastebin.com/Trbq67ub When my script is called via crm_resource, the strace is this: Bad: https://pastebin.com/jtbzHrUM The first difference I can see happens around line 929 in the good paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" exists, which doesn't in the bad paste. Shortly after, I start seeing: line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] line: [brk(NULL) = 0x562b7877d000] line: [brk(0x562b787aa000) = 0x562b787aa000] line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] Around line 959 in the bad paste. There are more brk() lines, and not long after the output stops. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ _
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 07:46, Bob Peterson wrote: On 1/11/23 1:06 AM, Madison Kelly wrote: On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr In the failing case do you get any interesting messages on the console or in dmesg? Bob Peterson Nope, nothing in dmesg. At the console, I see: [root@mk-a07n02 ~]# crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test provider="alteeve"> execution_message="Timed Out" reason="Resource agent did not exit within specified timeout"/> crm_resource: Error performing operation: Error occurred -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)
On January 11, 2023 2:26:57 a.m. EST, Ulrich Windl wrote: >>>> Madison Kelly schrieb am 11.01.2023 um 06:21 in >>>> Nachricht ><74df2c8e-1cff-ba07-7f4a-070be296b...@alteeve.com>: >> On 2023-01-11 00:14, Madison Kelly wrote: >>> Hi all, >>> >>> Edit: Last message was in HTML format, sorry about that. >>> >>>I've got a hell of a weird problem, and I am absolutely stumped on >>> what's going on. >>> >>>The short of it is; if my RA is called from the command line, it's >>> fine. If a resource exists, monitor, enable, disable, all that stuff >>> works just fine. If I try to create a resource, it hangs on the validate >>> stage. Specifically, it hangs when 'pcs' calls: >>> >>> crm_resource --validate --output-as xml --class ocf --agent server >>> --provider alteeve --option name= >>> >>>Specifically, it hangs when it tries to make a shell call (to virsh, >>> specifically, but that doesn't matter). So to debug, I started stripping >>> down my RA simpler and simpler until I was left with the very most basic >>> of programs; >>> >>> https://pastebin.com/VtSpkwMr >>> >>>That is literally the simplest program I could write that made the >>> shell call. The 'open()' call is where it hangs. >>> >>> When I call directly; >>> >>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server >>> srv04-test; echo rc:$? >>> >>> >>> real0m0.061s >>> user0m0.037s >>> sys0m0.014s >>> rc:0 >>> >>> >>> It's just fine. I can see in the log the output from the 'virsh' call as >>> well. However, when I call from crm_resource; >>> >>> time crm_resource --validate --output-as xml --class ocf --agent server >>> --provider alteeve --option name=srv04-test; echo rc:$? >>> >>> >>> >>>>> provider="alteeve"> >>> >>> >> execution_message="Timed Out" reason="Resource agent did not exit within >>> specified timeout"/> >>> >>> >>> >>>crm_resource: Error performing operation: Error >>> occurred >>> >>> >>> >>> >>> real0m20.521s >>> user0m0.022s >>> sys0m0.010s >>> rc:1 >>> >>> >>> In the log file, I see (from line 20 of the super-simple-test-script): >>> >>> >>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; >>> /usr/bin/echo return_code:0 |] >>> > >In VirtualDomain RA I found a similar command (assuming that works): > virsh $VIRSH_OPTIONS dumpxml --inactive --security-info ${DOMAIN_NAME} > > ${CFGTMP} > >virsh is somewhat strange; libvirtd is running, right? Yes, I can call the RA directly, then immediately call crm_resource, or reverse order, always the same results. Again, same calls work fine when enabling, disabling, etc. So weird... >>> >>> Then nothing else. >>> >>> The strace output is: https://pastebin.com/raw/UCEUdBeP >>> >>> Environment; >>> >>> * selinux is permissive >>> * Pacemaker 2.1.5-4.el8 >>> * pcs 0.10.15 >>> * 4.18.0-408.el8.x86_64 >>> * CentOS Stream release 8 >>> >>> Any help is appreciated, I am stumped. :/ >> >> After sending this, I tried having my "RA" call 'hostname', and that >> worked fine. I switched back to 'virsh list --all', and that hangs. So >> it seems to somehow be related to call 'virsh' specifically. >> >> -- >> Madison Kelly >> Alteeve's Niche! >> Chief Technical Officer >> c: +1-647-471-0951 >> https://alteeve.com/ >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > > > >___ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 00:21, Madison Kelly wrote: On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real 0m0.061s user 0m0.037s sys 0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? provider="alteeve"> execution_message="Timed Out" reason="Resource agent did not exit within specified timeout"/> crm_resource: Error performing operation: Error occurred real 0m20.521s user 0m0.022s sys 0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. OK, so more info... Knowing now that it's a problem with the virsh call specifically (but only when validating, existing VMs monitor, enable, disable fine, all which repeatedly call virsh), I noticed a few things. First, I see in the logs: Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: Connection reset by peer So with this, I further simplified my test script to this: https://pastebin.com/Ey8FdL1t Then when I ran my test script directly, the strace output is: Good: https://pastebin.com/Trbq67ub When my script is called via crm_resource, the strace is this: Bad: https://pastebin.com/jtbzHrUM The first difference I can see happens around line 929 in the good paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" exists, which doesn't in the bad paste. Shortly after, I start seeing: line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] line: [brk(NULL) = 0x562b7877d000] line: [brk(0x562b787aa000) = 0x562b787aa000] line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] Around line 959 in the bad paste. There are more brk() lines, and not long after the output stops. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RA hangs when called by crm_resource (resending text format)
On 2023-01-11 00:14, Madison Kelly wrote: Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real 0m0.061s user 0m0.037s sys 0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? provider="alteeve"> execution_message="Timed Out" reason="Resource agent did not exit within specified timeout"/> crm_resource: Error performing operation: Error occurred real 0m20.521s user 0m0.022s sys 0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ After sending this, I tried having my "RA" call 'hostname', and that worked fine. I switched back to 'virsh list --all', and that hangs. So it seems to somehow be related to call 'virsh' specifically. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] RA hangs when called by crm_resource (resending text format)
Hi all, Edit: Last message was in HTML format, sorry about that. I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real0m0.061s user0m0.037s sys0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? provider="alteeve"> execution_message="Timed Out" reason="Resource agent did not exit within specified timeout"/> crm_resource: Error performing operation: Error occurred real0m20.521s user0m0.022s sys0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] RA hangs when called by crm_resource
Hi all, I've got a hell of a weird problem, and I am absolutely stumped on what's going on. The short of it is; if my RA is called from the command line, it's fine. If a resource exists, monitor, enable, disable, all that stuff works just fine. If I try to create a resource, it hangs on the validate stage. Specifically, it hangs when 'pcs' calls: crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name= Specifically, it hangs when it tries to make a shell call (to virsh, specifically, but that doesn't matter). So to debug, I started stripping down my RA simpler and simpler until I was left with the very most basic of programs; https://pastebin.com/VtSpkwMr That is literally the simplest program I could write that made the shell call. The 'open()' call is where it hangs. When I call directly; time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server srv04-test; echo rc:$? real 0m0.061s user 0m0.037s sys 0m0.014s rc:0 It's just fine. I can see in the log the output from the 'virsh' call as well. However, when I call from crm_resource; time crm_resource --validate --output-as xml --class ocf --agent server --provider alteeve --option name=srv04-test; echo rc:$? crm_resource: Error performing operation: Error occurred real 0m20.521s user 0m0.022s sys 0m0.010s rc:1 In the log file, I see (from line 20 of the super-simple-test-script): Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; /usr/bin/echo return_code:0 |] Then nothing else. The strace output is: https://pastebin.com/raw/UCEUdBeP Environment; * selinux is permissive * Pacemaker 2.1.5-4.el8 * pcs 0.10.15 * 4.18.0-408.el8.x86_64 * CentOS Stream release 8 Any help is appreciated, I am stumped. :/ -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Preventing a resource from migrating to / starting on a node
On 2022-11-29 00:31, Reid Wahl wrote: On Mon, Nov 28, 2022 at 8:21 PM Madison Kelly wrote: This question builds on questions I was talking to kgaillot on IRC. I am try to prevent a resource from being allowed to migrate to or start on a given node. When I asked about this, Ken talked about node attributes, which I've been trying to implement. To try to figure this out / test this, I setup an attribute against a resource called 'srv01-sql' called 'drbd-fenced_srv01-psql' that sets a location constraint of -INFINITY. I had the resource running on 'mk-a01n01' and then set 'drbd-fenced_srv01-psql=1' to trigger the constraint against 'mk-a01n02'. I verified this was set, then tried migrating it, and it happily migrated. Clearly I am missing something. :) [root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02 --name drbd-fenced_srv01-sql --query scope=nodes name=drbd-fenced_srv01-sql value=1 [root@mk-a01n01 ~]# pcs constraint location config Location Constraints: Resource: srv01-sql Enabled on: Node: mk-a01n02 (score:100) Node: mk-a01n01 (score:200) Constraint: location-srv01-sql Rule: score=-INFINITY _expression_: drbd-fenced_srv01-sql eq 0 Resource: srv02-web Enabled on: Node: mk-a01n02 (score:100) Node: mk-a01n01 (score:200) [root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02 --name drbd-fenced_srv01-sql --query scope=nodes name=drbd-fenced_srv01-sql value=1 [root@mk-a01n01 ~]# pcs resource status srv01-sql * srv01-sql(ocf::alteeve:server): Started mk-a01n01 [root@mk-a01n01 ~]# pcs constraint location srv01-sql prefers mk-a01n02=200 mk-a01n01=100 [root@mk-a01n01 ~]# pcs resource status srv01-sql * srv01-sql(ocf::alteeve:server): Migrating mk-a01n01 [root@mk-a01n01 ~]# pcs resource status srv01-sql * srv01-sql(ocf::alteeve:server): Started mk-a01n02 I feel like this shouldn't be so complicated, so I am likely over-thinking this, or missing something obvious... -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ The configured rule prevents srv01-sql from running on a node where the drbd-fenced_srv01-sql attribute is set to 0. It looks like it's set to 1. Maybe I'm misunderstanding though -- if I am, can you help clarify and send the CIB so that I can mess around with it? Excuse me one second... "AARG!!" OK, now I am better. Thank you, that was the problem. :) -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Preventing a resource from migrating to / starting on a node
I was taking Ken's advice. Originally my plan was to use location constraints, but I assume Ken's reasoning was sound for the node attribute approach. On 2022-11-29 02:51, Ulrich Windl wrote: Why can't you use a plain location constraint? Madison Kelly schrieb am 29.11.2022 um 05:21 in Nachricht -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Preventing a resource from migrating to / starting on a node
This question builds on questions I was talking to kgaillot on IRC. I am try to prevent a resource from being allowed to migrate to or start on a given node. When I asked about this, Ken talked about node attributes, which I've been trying to implement. To try to figure this out / test this, I setup an attribute against a resource called 'srv01-sql' called 'drbd-fenced_srv01-psql' that sets a location constraint of -INFINITY. I had the resource running on 'mk-a01n01' and then set 'drbd-fenced_srv01-psql=1' to trigger the constraint against 'mk-a01n02'. I verified this was set, then tried migrating it, and it happily migrated. Clearly I am missing something. :) [root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02 --name drbd-fenced_srv01-sql --query scope=nodes name=drbd-fenced_srv01-sql value=1 [root@mk-a01n01 ~]# pcs constraint location config Location Constraints: Resource: srv01-sql Enabled on: Node: mk-a01n02 (score:100) Node: mk-a01n01 (score:200) Constraint: location-srv01-sql Rule: score=-INFINITY _expression_: drbd-fenced_srv01-sql eq 0 Resource: srv02-web Enabled on: Node: mk-a01n02 (score:100) Node: mk-a01n01 (score:200) [root@mk-a01n01 ~]# crm_attribute --type nodes --node mk-a01n02 --name drbd-fenced_srv01-sql --query scope=nodes name=drbd-fenced_srv01-sql value=1 [root@mk-a01n01 ~]# pcs resource status srv01-sql * srv01-sql (ocf::alteeve:server): Started mk-a01n01 [root@mk-a01n01 ~]# pcs constraint location srv01-sql prefers mk-a01n02=200 mk-a01n01=100 [root@mk-a01n01 ~]# pcs resource status srv01-sql * srv01-sql (ocf::alteeve:server): Migrating mk-a01n01 [root@mk-a01n01 ~]# pcs resource status srv01-sql * srv01-sql (ocf::alteeve:server): Started mk-a01n02 I feel like this shouldn't be so complicated, so I am likely over-thinking this, or missing something obvious... -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] HA Summit 2023
Understanding that it's all but impossible to predict how covid will go, I think it's as good a time as any to start planning for the next HA Summit. Last time was in Brno hosted by Red Hat. So I suppose we can prod SUSE to host this time? SUSE folks, how does that sound? I'm thinking summer or fall of '23. Basically, consider this a "starting the ball rolling" and that's it. What makes sense to people? How comfortable would people be with restarting the HA Summits in person again? Any preference for location, timing, etc? Madi -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] DRBD and SQL Server
On 2022-09-28 04:36, Jehan-Guillaume de Rorthais wrote: On Wed, 28 Sep 2022 02:33:59 -0400 Madison Kelly wrote: ... I'm happy to go into more detail, but I'll stop here until/unless you have more questions. Otherwise I'd write a book. :) I would buy it ;) Haha! Feel free to email me directly if you'd like, with specific questions and I'll go into detail. Avoid flooding the channel. :) Madi -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] DRBD and SQL Server
The way we do it is to create a VM that runs Windows and hosts the DB, so the OS nor the DB have to have any concept that there's replication behind the scenes. Exactly how we do is specific to our platform (The Anvil), but in short we've got a custom RA for pacemaker and management tools to create LVs per VM, and those LVs become backing devices for a DRBD resource with 1 or more volumes. The resource runs in single-primary except when we want to live migrate, (all this is handled in our Pacemaker RA) then we enable dual-primary, promote the target to primary, migrate, demote the old host to Secondary and disable dual-primary support. Of course, protection is provided via IPMI fencing as the primary method with switched PDU fencing as a backup. I'm happy to go into more detail, but I'll stop here until/unless you have more questions. Otherwise I'd write a book. :) Madi On 2022-09-27 15:42, Eric Robinson wrote: Hi Madi, It sounds like you’ve had a lot of good experience. I’m trying to decide between paying a premium price for MSSQL Enterprise with Always-On Replication or just setting up an Active/Standby scenario with the Standard Edition of MSSQL running on DRBD. We have tons of experience with MySQL on DRBD, but not with MSSQL. When running MSSQL on DRBD, what’s the cluster stack? How does failover work? When using MySQL, the service only runs on one server at a time. In a failover, the writable data volume transitions to the standby server and then the MySQL service is started on it. Does it work the same way with MSQL? -Eric From: Madison Kelly Sent: Monday, September 26, 2022 7:55 PM To: Cluster Labs - All topics related to open-source clustering welcomed ; Eric Robinson Subject: Re: [ClusterLabs] DRBD and SQL Server On 2022-09-25 23:49, Eric Robinson wrote: Hey list, Anybody have experience running SQL Server on DRBD? I’d ask this in the DRBD list but that one is like a ghost town. This list is the next best option. -Eric Extensively, yes. Albeit in VMs whose storage was backed by DRBD, though for all practical purposes there's no real difference. We've had clients running various DB servers for over ten years spanning DRBD 8.3 through to the latest 9.1. What's your question? Madi -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] DRBD and SQL Server
On 2022-09-25 23:49, Eric Robinson wrote: Hey list, Anybody have experience running SQL Server on DRBD? I’d ask this in the DRBD list but that one is like a ghost town. This list is the next best option. -Eric Extensively, yes. Albeit in VMs whose storage was backed by DRBD, though for all practical purposes there's no real difference. We've had clients running various DB servers for over ten years spanning DRBD 8.3 through to the latest 9.1. What's your question? Madi -- Madison Kelly Alteeve's Niche! Chief Technical Officer c: +1-647-471-0951 https://alteeve.com/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/