Some more reading of the source code makes me think the " || [ "$__OCF_ACTION" != "stop" ]; "is not needed.
Xen_Status_with_Retry() is only called from Stop and Monitor so we only need to check if it's a probe. Everything else should be handled in the case statement in the loop. Tom On 10/16/2013 05:16 PM, Tom Parker wrote: > Hi. I think there is an issue with the Updated Xen RA. > > I think there is an issue with the if statement here but I am not sure. > I may be confused about how bash || works but I don't see my servers > ever entering the loop on a vm disappearing. > > if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then > return $rc > fi > > Does this not mean that if we run a monitor operation that is not a > probe we will have: > > (ocf_is_probe) return false > (stop != monitor) return true > (false || true) return true > > which will cause the if statement to return $rc and never enter the loop? > > Xen_Status_with_Retry() { > local rc cnt=5 > > Xen_Status $1 > rc=$? > if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then > return $rc > fi > while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do > case "$__OCF_ACTION" in > stop) > ocf_log debug "domain $1 reported as not running, waiting $cnt > seconds ..." > ;; > monitor) > ocf_log warn "domain $1 reported as not running, but it is > expected to be running! Retrying for $cnt seconds ..." > ;; > *) : not reachable > ;; > esac > sleep 1 > Xen_Status $1 > rc=$? > let cnt=$((cnt-1)) > done > return $rc > } > > > > On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: >> Hi Tom, >> >> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: >>> Hi Dejan >>> >>> Just a quick question. I cannot see your new log messages being logged >>> to syslog >>> >>> ocf_log warn "domain $1 reported as not running, but it is expected to >>> be running! Retrying for $cnt seconds ... >>> >>> Do you know where I can set my logging to see warn level messages? I >>> expected to see them in my testing by default but that does not seem to >>> be true. >> You should see them by default. But note that these warnings may >> not happen, depending on the circumstances on your host. In my >> experiments they were logged only while the guest was rebooting >> and then just once or maybe twice. If you have recent >> resource-agents and crmsh, you can enable operation tracing (with >> crm resource trace <rsc> monitor <interval>). >> >> Thanks, >> >> Dejan >> >>> Thanks >>> >>> Tom >>> >>> >>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: >>>> Hi, >>>> >>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: >>>>> Hi! >>>>> >>>>> I thought, I'll never be bitten by this bug, but I actually was! Now I'm >>>>> wondering whether the Xen RA sees the guest if you use pygrub, and pygrub >>>>> is >>>>> still counting down for actual boot... >>>>> >>>>> But the reason why I'm writing is that I think I've discovered another >>>>> bug in >>>>> the RA: >>>>> >>>>> CRM decided to "recover" the guest VM "v02": >>>>> [...] >>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client >>>>> 14906: >>>>> pid 19516 exited with return code 7 >>>>> [...] >>>>> pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) >>>>> [...] >>>>> crmd: [14906]: info: te_rsc_command: Initiating action 5: stop >>>>> prm_xen_v02_stop_0 on h05 (local) >>>>> [...] >>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. >>>>> [...] >>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: >>>>> pid >>>>> 19552 exited with return code 0 >>>>> [...] >>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start >>>>> prm_xen_v02_start_0 on h05 (local) >>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) >>>>> [...] >>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain >>>>> 'v02' >>>>> already exists with ID '3' >>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config >>>>> file >>>>> "/etc/xen/vm/v02". >>>>> [...] >>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: >>>>> pid >>>>> 19686 exited with return code 1 >>>>> [...] >>>>> crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 >>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error >>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on >>>>> h05 >>>>> failed (target: 0 vs. rc: 1): Error >>>>> [...] >>>>> >>>>> As you can clearly see "start" failed, because the guest was found up >>>>> already! >>>>> IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). >>>> Yes, I've seen that. It's basically the same issue, i.e. the >>>> domain being gone for a while and then reappearing. >>>> >>>>> I guess the following test is problematic: >>>>> --- >>>>> xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME >>>>> rc=$? >>>>> if [ $rc -ne 0 ]; then >>>>> return $OCF_ERR_GENERIC >>>>> --- >>>>> Here "xm create" probably fails if the guest is already created... >>>> It should fail too. Note that this is a race, but the race is >>>> anyway caused by the strange behaviour of xen. With the recent >>>> fix (or workaround) in the RA, this shouldn't be happening. >>>> >>>> Thanks, >>>> >>>> Dejan >>>> >>>>> Regards, >>>>> Ulrich >>>>> >>>>> >>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um 12:24 >>>>>>>> in >>>>> Nachricht <20131001102430.GA4687@walrus.homenet>: >>>>>> Hi, >>>>>> >>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: >>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote: >>>>>>> >>>>>>>> Thanks for paying attention to this issue (not really a bug) as I am >>>>>>>> sure I am not the only one with this issue. For now I have set all my >>>>>>>> VMs to destroy so that the cluster is the only thing managing them but >>>>>>>> this is not super clean as I get failures in my logs that are not >>>>>>>> really >>>>>>>> failures. >>>>>>> It is very much a severe bug. >>>>>>> >>>>>>> The Xen RA has gained a workaround for this now, but we're also pushing >>>>>> Take a look here: >>>>>> >>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dejan >>>>>> >>>>>>> the Xen team (where the real problem is) to investigate and fix. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Lars >>>>>>> >>>>>>> -- >>>>>>> Architect Storage/HA >>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix >>>>>>> Imendörffer, >>>>>> HRB 21284 (AG Nürnberg) >>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar >>>>>>> Wilde >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linux-HA mailing list >>>>>>> Linux-HA@lists.linux-ha.org >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>> _______________________________________________ >>>>>> Linux-HA mailing list >>>>>> Linux-HA@lists.linux-ha.org >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>> See also: http://linux-ha.org/ReportingProblems >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> Linux-HA@lists.linux-ha.org >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> Linux-HA@lists.linux-ha.org >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems