Hi. I think there is an issue with the Updated Xen RA. I think there is an issue with the if statement here but I am not sure. I may be confused about how bash || works but I don't see my servers ever entering the loop on a vm disappearing.
if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then return $rc fi Does this not mean that if we run a monitor operation that is not a probe we will have: (ocf_is_probe) return false (stop != monitor) return true (false || true) return true which will cause the if statement to return $rc and never enter the loop? Xen_Status_with_Retry() { local rc cnt=5 Xen_Status $1 rc=$? if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then return $rc fi while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do case "$__OCF_ACTION" in stop) ocf_log debug "domain $1 reported as not running, waiting $cnt seconds ..." ;; monitor) ocf_log warn "domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ..." ;; *) : not reachable ;; esac sleep 1 Xen_Status $1 rc=$? let cnt=$((cnt-1)) done return $rc } On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: > Hi Tom, > > On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: >> Hi Dejan >> >> Just a quick question. I cannot see your new log messages being logged >> to syslog >> >> ocf_log warn "domain $1 reported as not running, but it is expected to >> be running! Retrying for $cnt seconds ... >> >> Do you know where I can set my logging to see warn level messages? I >> expected to see them in my testing by default but that does not seem to >> be true. > You should see them by default. But note that these warnings may > not happen, depending on the circumstances on your host. In my > experiments they were logged only while the guest was rebooting > and then just once or maybe twice. If you have recent > resource-agents and crmsh, you can enable operation tracing (with > crm resource trace <rsc> monitor <interval>). > > Thanks, > > Dejan > >> Thanks >> >> Tom >> >> >> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: >>> Hi, >>> >>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: >>>> Hi! >>>> >>>> I thought, I'll never be bitten by this bug, but I actually was! Now I'm >>>> wondering whether the Xen RA sees the guest if you use pygrub, and pygrub >>>> is >>>> still counting down for actual boot... >>>> >>>> But the reason why I'm writing is that I think I've discovered another bug >>>> in >>>> the RA: >>>> >>>> CRM decided to "recover" the guest VM "v02": >>>> [...] >>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: >>>> pid 19516 exited with return code 7 >>>> [...] >>>> pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) >>>> [...] >>>> crmd: [14906]: info: te_rsc_command: Initiating action 5: stop >>>> prm_xen_v02_stop_0 on h05 (local) >>>> [...] >>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. >>>> [...] >>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: >>>> pid >>>> 19552 exited with return code 0 >>>> [...] >>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start >>>> prm_xen_v02_start_0 on h05 (local) >>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) >>>> [...] >>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain >>>> 'v02' >>>> already exists with ID '3' >>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config >>>> file >>>> "/etc/xen/vm/v02". >>>> [...] >>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: >>>> pid >>>> 19686 exited with return code 1 >>>> [...] >>>> crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 >>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error >>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05 >>>> failed (target: 0 vs. rc: 1): Error >>>> [...] >>>> >>>> As you can clearly see "start" failed, because the guest was found up >>>> already! >>>> IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). >>> Yes, I've seen that. It's basically the same issue, i.e. the >>> domain being gone for a while and then reappearing. >>> >>>> I guess the following test is problematic: >>>> --- >>>> xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME >>>> rc=$? >>>> if [ $rc -ne 0 ]; then >>>> return $OCF_ERR_GENERIC >>>> --- >>>> Here "xm create" probably fails if the guest is already created... >>> It should fail too. Note that this is a race, but the race is >>> anyway caused by the strange behaviour of xen. With the recent >>> fix (or workaround) in the RA, this shouldn't be happening. >>> >>> Thanks, >>> >>> Dejan >>> >>>> Regards, >>>> Ulrich >>>> >>>> >>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um 12:24 >>>>>>> in >>>> Nachricht <20131001102430.GA4687@walrus.homenet>: >>>>> Hi, >>>>> >>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: >>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote: >>>>>> >>>>>>> Thanks for paying attention to this issue (not really a bug) as I am >>>>>>> sure I am not the only one with this issue. For now I have set all my >>>>>>> VMs to destroy so that the cluster is the only thing managing them but >>>>>>> this is not super clean as I get failures in my logs that are not really >>>>>>> failures. >>>>>> It is very much a severe bug. >>>>>> >>>>>> The Xen RA has gained a workaround for this now, but we're also pushing >>>>> Take a look here: >>>>> >>>>> https://github.com/ClusterLabs/resource-agents/pull/314 >>>>> >>>>> Thanks, >>>>> >>>>> Dejan >>>>> >>>>>> the Xen team (where the real problem is) to investigate and fix. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Lars >>>>>> >>>>>> -- >>>>>> Architect Storage/HA >>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix >>>>>> Imendörffer, >>>>> HRB 21284 (AG Nürnberg) >>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde >>>>>> >>>>>> _______________________________________________ >>>>>> Linux-HA mailing list >>>>>> Linux-HA@lists.linux-ha.org >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>> See also: http://linux-ha.org/ReportingProblems >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> Linux-HA@lists.linux-ha.org >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> Linux-HA@lists.linux-ha.org >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems