On Mon, Oct 25, 2021 at 9:34 PM Kyle O'Donnell <ky...@0b10.mx> wrote:
> Finally got around to working on this. > > I spoke with someone on the #cluterslabs IRC channel who mentioned that > the monitor_scripts param does indeed run at some frequency (op monitor > timeout=? interval=?), not just during the "start" and "migrate_from" > actions. > > The monitor_scripts param does not support scripts with command line args, > just a space delimited list for running multiple scripts. This means that > each VirtualDomain resource needs its own script to be able to define the > ${DOMAIN_NAME}. I found that a bit annoying so I created a symlink to a > wrapper script using the ${DOMAIN_NAME} as the first part of the filename > and a separator for awk: > > The scripts being called by the monitor operation should inherit the environment from the monitor so that you should be able to use these variables. Klaus > ln -s /path/to/wrapper_script.sh > /path/to/wrapper/myvmhostname_____wrapper_script.sh > > and in my wrapper_script.sh: > #!/bin/bash > DOMAIN_NAME=$(basename "$0" |awk -F'____' '{print $1}') > /path/to/myscript.sh -H ${DOMAIN_NAME} -C guest-get-time -l 25 -w 1 > > (a bit hack-y but better than creating 1 script per vm resource and > modifying it with the ${DOMAIN_NAME}) > > Then creating the cluster resource: > pcs resource create myvmhostname VirtualDomain > config="/path/to/myvmhostname/myvmhostname.xml" hypervisor="qemu:///system" > migration_transport="ssh" force_stop="false" > monitor_scripts="/path/to/wrapper/myvmhostname_____wrapper_script.sh" meta > allow-migrate="true" target-role="Stopped" op migrate_from timeout=90s > interval=0s op migrate_to timeout=120s interval=0s op monitor timeout=40s > interval=10s op start timeout=90s interval=0s op stop timeout=90s > interval=0s > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > On Sunday, June 6th, 2021 at 16:56, Kyle O'Donnell <ky...@0b10.mx> wrote: > > > Let me know if there is a better approach to the following problem. When > the virtual machine does not respond to a state query I want the cluster to > kick it > > > > I could not find any useful docs for using the nagios plugins. After > reading the documentation about running a custom script via the "monitor" > function in the RA I determined that would not meet my requirements as it's > only run on start and migrate(unless I read it incorrectly?). > > > > Here is what I did (im on ubuntu 20.04): > > > > cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain > > > > cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp > /usr/share/resource-agents/ocft/configs/MyVirtDomain > > > > sed -i 's/VirtualDomain/MyVirtDomain/g' > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain > > > > sed -i 's/VirtualDomain/MyVirtDomain/g' > /usr/share/resource-agents/ocft/configs/MyVirtDomain > > > > edited function MyVirtDomain_status in > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the > status case running|paused|idle|blocked|"in shutdown") > > > > FROM > > > > running|paused|idle|blocked|"in shutdown") > > > > # running: domain is currently actively consuming cycles > > > > # paused: domain is paused (suspended) > > > > # idle: domain is running but idle > > > > # blocked: synonym for idle used by legacy Xen versions > > > > # in shutdown: the domain is in process of shutting down, but has not > completely shutdown or crashed. > > > > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status." > > > > rc=$OCF_SUCCESS > > > > TO > > > > running|paused|idle|blocked|"in shutdown") > > > > # running: domain is currently actively consuming cycles > > > > # paused: domain is paused (suspended) > > > > # idle: domain is running but idle > > > > # blocked: synonym for idle used by legacy Xen versions > > > > # in shutdown: the domain is in process of shutting down, but has not > completely shutdown or crashed. > > > > custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l > 25 -w 1) > > > > custom_rc=$? > > > > if [ ${custom_rc} -eq 0 ]; then > > > > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status." > > > > rc=$OCF_SUCCESS > > > > else > > > > ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}." > > > > rc=$OCF_ERR_GENERIC > > > > fi > > > > The custom script uses the qemu-guest-agent in my guest, passing the > parameter to grab the guest's time (seems to be most universal [windows, > centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between > iterations, exit 0 as soon as the agent responds with the time and exit 1 > after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on > docs. > > > > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1 > > ========================================================= > > > > [GOOD] - myvm virsh qemu-agent-command guest-get-time output: > {"return":1623011582178375000} > > > > or when its not responding: > > > > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1 > > ========================================================= > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: > Guest agent is not responding: QEMU guest agent is not connected > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: > Guest agent is not responding: QEMU guest agent is not connected > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: > Guest agent is not responding: QEMU guest agent is not connected > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: > Guest agent is not responding: QEMU guest agent is not connected > > > > ... (exits after 25th or > > > > [GOOD] - myvm virsh qemu-agent-command guest-get-time output: > {"return":1623011582178375000} > > > > and when the vm isnt running: > > > > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1 > > ========================================================= > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: > failed to get domain 'myvm' > > > > I updated my test vm to use the new RA, updated the status timeout to > 40s from default of 30s just in case. > > > > I'd like to be able to update the parameters to myscript.sh via crm > configure edit at some point, but will figure that out later... > > > > My test: > > > > reboot the VM from within the OS, hit escape so that I enter the boot > mode prompt... after ~30 seconds the cluster decides the resource is having > a problem, marks it as failed, and restarts the virtual machine (on the > same node -- which in my case in desirable), once the guest is back up and > responding the cluster reports the VM as Started > > > > I still have plenty more testing to do and will keep the list posted on > progress. > > > > -Kyle > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell ky...@0b10.mx > wrote: > > > > > guest-get-fsinfo doesn't seem to work on older agents (centos6) I've > found guest-get-time more universal. > > > > > > Also, found this helpful thread on using monitor_scripts which is part > of the VirtualDomain RA > > > > > > > https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > > > On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell ky...@0b10.mx > wrote: > > > > > > > I am thinking about using the qemu-guest-agent to run one of the > available commands to determine the health of the OS inside > > > > > > > > virsh qemu-agent-command myvm --pretty > '{"execute":"guest-get-fsinfo"}' > > > > > > > > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html > > > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > > > > > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov > arvidj...@gmail.com wrote: > > > > > > > > > On 03.05.2021 09:48, Ulrich Windl wrote: > > > > > > > > > > > > > > Ken Gaillot kgail...@redhat.com schrieb am 30.04.2021 um > 16:57 in > > > > > > > > > > > > > > > > > > Nachricht > > > > > > > > > > > > > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.ca...@redhat.com: > > > > > > > > > > > > > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote: > > > > > > > > > > > > > > > Hi guys > > > > > > > > > > > > > > > > I'd like to ask around for thoughts & suggestions on any > > > > > > > > > > > > > > > > semi/official ways to monitor VirtualDomain. > > > > > > > > > > > > > > > > Something beyond what included RA does ‑ such as actual > > > > > > > > > > > > > > > > health testing of and communication with VM's OS. > > > > > > > > > > > > > > > > many thanks, L. > > > > > > > > > > > > > > This use case led to a Pacemaker feature many moons ago ... > > > > > > > > > > > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g. > > > > > > > > > > > > > > nagios:check_apache_status). These are service checks usually > used with > > > > > > > > > > > > > > monitoring software such as nagios, icinga, etc. > > > > > > > > > > > > > > If the service being monitored is inside a VirtualDomain, > named vm1 for > > > > > > > > > > > > > > example, you can configure the nagios resource with the > resource meta‑ > > > > > > > > > > > > > > attribute container="vm1". If the nagios check fails, > Pacemaker will > > > > > > > > > > > > > > restart vm1. > > > > > > > > > > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-) > > > > > > > > > > switch (rc) { > > > > > > > > > > case NAGIOS_STATE_OK: > > > > > > > > > > return PCMK_OCF_OK; > > > > > > > > > > case NAGIOS_INSUFFICIENT_PRIV: > > > > > > > > > > return PCMK_OCF_INSUFFICIENT_PRIV; > > > > > > > > > > case NAGIOS_NOT_INSTALLED: > > > > > > > > > > return PCMK_OCF_NOT_INSTALLED; > > > > > > > > > > case NAGIOS_STATE_WARNING: > > > > > > > > > > case NAGIOS_STATE_CRITICAL: > > > > > > > > > > case NAGIOS_STATE_UNKNOWN: > > > > > > > > > > case NAGIOS_STATE_DEPENDENT: > > > > > > > > > > default: > > > > > > > > > > return PCMK_OCF_UNKNOWN_ERROR; > > > > > > > > > > } > > > > > > > > > > return PCMK_OCF_UNKNOWN_ERROR; > > > > > > > > > > Manage your subscription: > > > > > > > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > Manage your subscription: > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/