On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote: > Hi Tom, > > On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote: > > Some more reading of the source code makes me think the " || [ > > "$__OCF_ACTION" != "stop" ]; "is not needed. > > Yes, you're right. I'll drop that part of the if statement. Many > thanks for testing.
Fixed now. The if statement, which was obviously hard to follow, got relegated to the monitor function. Which makes the Xen_Status_with_Retry really stand for what's happening in there ;-) Tom, hope you can test again. Cheers, Dejan > Cheers, > > Dejan > > > Xen_Status_with_Retry() is only called from Stop and Monitor so we only > > need to check if it's a probe. Everything else should be handled in the > > case statement in the loop. > > > > Tom > > > > On 10/16/2013 05:16 PM, Tom Parker wrote: > > > Hi. I think there is an issue with the Updated Xen RA. > > > > > > I think there is an issue with the if statement here but I am not sure. > > > I may be confused about how bash || works but I don't see my servers > > > ever entering the loop on a vm disappearing. > > > > > > if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then > > > return $rc > > > fi > > > > > > Does this not mean that if we run a monitor operation that is not a > > > probe we will have: > > > > > > (ocf_is_probe) return false > > > (stop != monitor) return true > > > (false || true) return true > > > > > > which will cause the if statement to return $rc and never enter the loop? > > > > > > Xen_Status_with_Retry() { > > > local rc cnt=5 > > > > > > Xen_Status $1 > > > rc=$? > > > if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then > > > return $rc > > > fi > > > while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do > > > case "$__OCF_ACTION" in > > > stop) > > > ocf_log debug "domain $1 reported as not running, waiting $cnt > > > seconds ..." > > > ;; > > > monitor) > > > ocf_log warn "domain $1 reported as not running, but it is > > > expected to be running! Retrying for $cnt seconds ..." > > > ;; > > > *) : not reachable > > > ;; > > > esac > > > sleep 1 > > > Xen_Status $1 > > > rc=$? > > > let cnt=$((cnt-1)) > > > done > > > return $rc > > > } > > > > > > > > > > > > On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: > > >> Hi Tom, > > >> > > >> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: > > >>> Hi Dejan > > >>> > > >>> Just a quick question. I cannot see your new log messages being logged > > >>> to syslog > > >>> > > >>> ocf_log warn "domain $1 reported as not running, but it is expected to > > >>> be running! Retrying for $cnt seconds ... > > >>> > > >>> Do you know where I can set my logging to see warn level messages? I > > >>> expected to see them in my testing by default but that does not seem to > > >>> be true. > > >> You should see them by default. But note that these warnings may > > >> not happen, depending on the circumstances on your host. In my > > >> experiments they were logged only while the guest was rebooting > > >> and then just once or maybe twice. If you have recent > > >> resource-agents and crmsh, you can enable operation tracing (with > > >> crm resource trace <rsc> monitor <interval>). > > >> > > >> Thanks, > > >> > > >> Dejan > > >> > > >>> Thanks > > >>> > > >>> Tom > > >>> > > >>> > > >>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: > > >>>> Hi, > > >>>> > > >>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: > > >>>>> Hi! > > >>>>> > > >>>>> I thought, I'll never be bitten by this bug, but I actually was! Now > > >>>>> I'm > > >>>>> wondering whether the Xen RA sees the guest if you use pygrub, and > > >>>>> pygrub is > > >>>>> still counting down for actual boot... > > >>>>> > > >>>>> But the reason why I'm writing is that I think I've discovered > > >>>>> another bug in > > >>>>> the RA: > > >>>>> > > >>>>> CRM decided to "recover" the guest VM "v02": > > >>>>> [...] > > >>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client > > >>>>> 14906: > > >>>>> pid 19516 exited with return code 7 > > >>>>> [...] > > >>>>> pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started > > >>>>> h05) > > >>>>> [...] > > >>>>> crmd: [14906]: info: te_rsc_command: Initiating action 5: stop > > >>>>> prm_xen_v02_stop_0 on h05 (local) > > >>>>> [...] > > >>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. > > >>>>> [...] > > >>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client > > >>>>> 14906: pid > > >>>>> 19552 exited with return code 0 > > >>>>> [...] > > >>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start > > >>>>> prm_xen_v02_start_0 on h05 (local) > > >>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) > > >>>>> [...] > > >>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: > > >>>>> Domain 'v02' > > >>>>> already exists with ID '3' > > >>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using > > >>>>> config file > > >>>>> "/etc/xen/vm/v02". > > >>>>> [...] > > >>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client > > >>>>> 14906: pid > > >>>>> 19686 exited with return code 1 > > >>>>> [...] > > >>>>> crmd: [14906]: info: process_lrm_event: LRM operation > > >>>>> prm_xen_v02_start_0 > > >>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error > > >>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) > > >>>>> on h05 > > >>>>> failed (target: 0 vs. rc: 1): Error > > >>>>> [...] > > >>>>> > > >>>>> As you can clearly see "start" failed, because the guest was found up > > >>>>> already! > > >>>>> IMHO this is a bug in the RA (SLES11 SP2: > > >>>>> resource-agents-3.9.4-0.26.84). > > >>>> Yes, I've seen that. It's basically the same issue, i.e. the > > >>>> domain being gone for a while and then reappearing. > > >>>> > > >>>>> I guess the following test is problematic: > > >>>>> --- > > >>>>> xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME > > >>>>> rc=$? > > >>>>> if [ $rc -ne 0 ]; then > > >>>>> return $OCF_ERR_GENERIC > > >>>>> --- > > >>>>> Here "xm create" probably fails if the guest is already created... > > >>>> It should fail too. Note that this is a race, but the race is > > >>>> anyway caused by the strange behaviour of xen. With the recent > > >>>> fix (or workaround) in the RA, this shouldn't be happening. > > >>>> > > >>>> Thanks, > > >>>> > > >>>> Dejan > > >>>> > > >>>>> Regards, > > >>>>> Ulrich > > >>>>> > > >>>>> > > >>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um > > >>>>>>>> 12:24 in > > >>>>> Nachricht <20131001102430.GA4687@walrus.homenet>: > > >>>>>> Hi, > > >>>>>> > > >>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: > > >>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote: > > >>>>>>> > > >>>>>>>> Thanks for paying attention to this issue (not really a bug) as I > > >>>>>>>> am > > >>>>>>>> sure I am not the only one with this issue. For now I have set > > >>>>>>>> all my > > >>>>>>>> VMs to destroy so that the cluster is the only thing managing them > > >>>>>>>> but > > >>>>>>>> this is not super clean as I get failures in my logs that are not > > >>>>>>>> really > > >>>>>>>> failures. > > >>>>>>> It is very much a severe bug. > > >>>>>>> > > >>>>>>> The Xen RA has gained a workaround for this now, but we're also > > >>>>>>> pushing > > >>>>>> Take a look here: > > >>>>>> > > >>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 > > >>>>>> > > >>>>>> Thanks, > > >>>>>> > > >>>>>> Dejan > > >>>>>> > > >>>>>>> the Xen team (where the real problem is) to investigate and fix. > > >>>>>>> > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Lars > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Architect Storage/HA > > >>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix > > >>>>>>> Imendörffer, > > >>>>>> HRB 21284 (AG Nürnberg) > > >>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar > > >>>>>>> Wilde > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> Linux-HA mailing list > > >>>>>>> Linux-HA@lists.linux-ha.org > > >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > > >>>>>>> See also: http://linux-ha.org/ReportingProblems > > >>>>>> _______________________________________________ > > >>>>>> Linux-HA mailing list > > >>>>>> Linux-HA@lists.linux-ha.org > > >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > > >>>>>> See also: http://linux-ha.org/ReportingProblems > > >>>>> _______________________________________________ > > >>>>> Linux-HA mailing list > > >>>>> Linux-HA@lists.linux-ha.org > > >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > > >>>>> See also: http://linux-ha.org/ReportingProblems > > >>>> _______________________________________________ > > >>>> Linux-HA mailing list > > >>>> Linux-HA@lists.linux-ha.org > > >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > > >>>> See also: http://linux-ha.org/ReportingProblems > > >>> _______________________________________________ > > >>> Linux-HA mailing list > > >>> Linux-HA@lists.linux-ha.org > > >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > > >>> See also: http://linux-ha.org/ReportingProblems > > >> _______________________________________________ > > >> Linux-HA mailing list > > >> Linux-HA@lists.linux-ha.org > > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > > >> See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > > > Linux-HA mailing list > > > Linux-HA@lists.linux-ha.org > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > > > _______________________________________________ > > Linux-HA mailing list > > Linux-HA@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems