Hi Dejan. How can I revert my commits so that they are not include multiple things? I will submit one patch with the logging cleanup and then if needed another with my changes to the meta-data.
Tom On 10/21/2013 09:39 AM, Dejan Muhamedagic wrote: > Hi Ulrich! > > On Mon, Oct 21, 2013 at 09:28:50AM +0200, Ulrich Windl wrote: >> Hi! >> >> Basically I think there should be no hard-coded constants whose value depends >> on some performance measurements, like 5s for rebooting a VM. > It's actually not 5s, but the status is run 5 times. If the load > is high, my guess is that the Xen tools used by the RA would > suffer proportionally. > >> So I support >> Tom's changes. >> >> However I noticed: >> >> +running; apparently, this period lasts only for a second or >> +two >> >> (missing full stop at end of sentence) > That's at the end of the comment and, typically, comments end > with a carriage return (as is here the case). > >> Actually I'd rephrase the description: >> >> "When the guest is rebooting, there is a short interval where the guest >> completely disappears from "xm list", which, in turn, will cause the monitor >> operation to return a "not running" status. If the guest cannot be found , >> this >> value will cause some extra delay in the monitor operation to work around the >> problem." >> >> (I.e. try to describe the effect, not the implementation) > That's the code, so the implementation is described. The very > top of the comment says: > > # If the guest is rebooting, it may completely disappear from the > # list of defined guests > > I was hoping that that was enough of an explanation. Look for > a more thorough description of the cause in the changelog. BTW, > note that this is a _workaround_ and that the thing should > eventually be fixed in Xen. > >> And yes, I appreciate consistent log formats also ;-) > That's always welcome, of course. It should also go in a > separate commit. > > Thanks, > > Dejan > >> Regards, >> Ulrich >> >>>>> Tom Parker <tpar...@cbnco.com> schrieb am 18.10.2013 um 19:30 in >> Nachricht >> <5261703a.5070...@cbnco.com>: >>> Hi Dejan. Sorry to be slow to respond to this. I have done some >>> testing and everything looks good. >>> >>> I spent some time tweaking the RA and I added a parameter called >>> wait_for_reboot (default 5s) to allow us to override the reboot sleep >>> times (in case it's more than 5 seconds on really loaded hypervisors). >>> I also cleaned up a few log entries to make them consistent in the RA >>> and edited your entries for xen status to be a little bit more clear as >>> to why we think we should be waiting. >>> >>> I have attached a patch here because I have NO idea how to create a >>> branch and pull request. If there are links to a good place to start I >>> may be able to contribute occasionally to some other RAs that I use. >>> >>> Please let me know what you think. >>> >>> Thanks for your help >>> >>> Tom >>> >>> >>> On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote: >>>> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote: >>>>> Hi Tom, >>>>> >>>>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote: >>>>>> Some more reading of the source code makes me think the " || [ >>>>>> "$__OCF_ACTION" != "stop" ]; "is not needed. >>>>> Yes, you're right. I'll drop that part of the if statement. Many >>>>> thanks for testing. >>>> Fixed now. The if statement, which was obviously hard to follow, >>>> got relegated to the monitor function. Which makes the >>>> Xen_Status_with_Retry really stand for what's happening in there ;-) >>>> >>>> Tom, hope you can test again. >>>> >>>> Cheers, >>>> >>>> Dejan >>>> >>>>> Cheers, >>>>> >>>>> Dejan >>>>> >>>>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we only >>>>>> need to check if it's a probe. Everything else should be handled in the >>>>>> case statement in the loop. >>>>>> >>>>>> Tom >>>>>> >>>>>> On 10/16/2013 05:16 PM, Tom Parker wrote: >>>>>>> Hi. I think there is an issue with the Updated Xen RA. >>>>>>> >>>>>>> I think there is an issue with the if statement here but I am not sure. >>>>>>> I may be confused about how bash || works but I don't see my servers >>>>>>> ever entering the loop on a vm disappearing. >>>>>>> >>>>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then >>>>>>> return $rc >>>>>>> fi >>>>>>> >>>>>>> Does this not mean that if we run a monitor operation that is not a >>>>>>> probe we will have: >>>>>>> >>>>>>> (ocf_is_probe) return false >>>>>>> (stop != monitor) return true >>>>>>> (false || true) return true >>>>>>> >>>>>>> which will cause the if statement to return $rc and never enter the >> loop? >>>>>>> Xen_Status_with_Retry() { >>>>>>> local rc cnt=5 >>>>>>> >>>>>>> Xen_Status $1 >>>>>>> rc=$? >>>>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then >>>>>>> return $rc >>>>>>> fi >>>>>>> while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do >>>>>>> case "$__OCF_ACTION" in >>>>>>> stop) >>>>>>> ocf_log debug "domain $1 reported as not running, waiting >> $cnt >>>>>>> seconds ..." >>>>>>> ;; >>>>>>> monitor) >>>>>>> ocf_log warn "domain $1 reported as not running, but it is >>>>>>> expected to be running! Retrying for $cnt seconds ..." >>>>>>> ;; >>>>>>> *) : not reachable >>>>>>> ;; >>>>>>> esac >>>>>>> sleep 1 >>>>>>> Xen_Status $1 >>>>>>> rc=$? >>>>>>> let cnt=$((cnt-1)) >>>>>>> done >>>>>>> return $rc >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: >>>>>>>> Hi Tom, >>>>>>>> >>>>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: >>>>>>>>> Hi Dejan >>>>>>>>> >>>>>>>>> Just a quick question. I cannot see your new log messages being >> logged >>>>>>>>> to syslog >>>>>>>>> >>>>>>>>> ocf_log warn "domain $1 reported as not running, but it is expected >> to >>>>>>>>> be running! Retrying for $cnt seconds ... >>>>>>>>> >>>>>>>>> Do you know where I can set my logging to see warn level messages? I >>>>>>>>> expected to see them in my testing by default but that does not seem >> to >>>>>>>>> be true. >>>>>>>> You should see them by default. But note that these warnings may >>>>>>>> not happen, depending on the circumstances on your host. In my >>>>>>>> experiments they were logged only while the guest was rebooting >>>>>>>> and then just once or maybe twice. If you have recent >>>>>>>> resource-agents and crmsh, you can enable operation tracing (with >>>>>>>> crm resource trace <rsc> monitor <interval>). >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Dejan >>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> Tom >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: >>>>>>>>>>> Hi! >>>>>>>>>>> >>>>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now >> I'm >>>>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and >> pygrub is >>>>>>>>>>> still counting down for actual boot... >>>>>>>>>>> >>>>>>>>>>> But the reason why I'm writing is that I think I've discovered >> another bug >>> in >>>>>>>>>>> the RA: >>>>>>>>>>> >>>>>>>>>>> CRM decided to "recover" the guest VM "v02": >>>>>>>>>>> [...] >>>>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client >> 14906: >>>>>>>>>>> pid 19516 exited with return code 7 >>>>>>>>>>> [...] >>>>>>>>>>> pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started >> h05) >>>>>>>>>>> [...] >>>>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 5: stop >>>>>>>>>>> prm_xen_v02_stop_0 on h05 (local) >>>>>>>>>>> [...] >>>>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. >>>>>>>>>>> [...] >>>>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client >> 14906: pid >>>>>>>>>>> 19552 exited with return code 0 >>>>>>>>>>> [...] >>>>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start >>>>>>>>>>> prm_xen_v02_start_0 on h05 (local) >>>>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) >>>>>>>>>>> [...] >>>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: >> Domain >>> 'v02' >>>>>>>>>>> already exists with ID '3' >>>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using >> config file >>>>>>>>>>> "/etc/xen/vm/v02". >>>>>>>>>>> [...] >>>>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client >> 14906: >>> pid >>>>>>>>>>> 19686 exited with return code 1 >>>>>>>>>>> [...] >>>>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation >> prm_xen_v02_start_0 >>>>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error >>>>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) >> on h05 >>>>>>>>>>> failed (target: 0 vs. rc: 1): Error >>>>>>>>>>> [...] >>>>>>>>>>> >>>>>>>>>>> As you can clearly see "start" failed, because the guest was found >> up >>> already! >>>>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2: >> resource-agents-3.9.4-0.26.84). >>>>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the >>>>>>>>>> domain being gone for a while and then reappearing. >>>>>>>>>> >>>>>>>>>>> I guess the following test is problematic: >>>>>>>>>>> --- >>>>>>>>>>> xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME >>>>>>>>>>> rc=$? >>>>>>>>>>> if [ $rc -ne 0 ]; then >>>>>>>>>>> return $OCF_ERR_GENERIC >>>>>>>>>>> --- >>>>>>>>>>> Here "xm create" probably fails if the guest is already created... >>>>>>>>>> It should fail too. Note that this is a race, but the race is >>>>>>>>>> anyway caused by the strange behaviour of xen. With the recent >>>>>>>>>> fix (or workaround) in the RA, this shouldn't be happening. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Dejan >>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Ulrich >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um >> 12:24 in >>>>>>>>>>> Nachricht <20131001102430.GA4687@walrus.homenet>: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree >> wrote: >>>>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I >> am >>>>>>>>>>>>>> sure I am not the only one with this issue. For now I have set >> all my >>>>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing >> them but >>>>>>>>>>>>>> this is not super clean as I get failures in my logs that are not >> really >>>>>>>>>>>>>> failures. >>>>>>>>>>>>> It is very much a severe bug. >>>>>>>>>>>>> >>>>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're also >> pushing >>>>>>>>>>>> Take a look here: >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Dejan >>>>>>>>>>>> >>>>>>>>>>>>> the Xen team (where the real problem is) to investigate and fix. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Lars >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Architect Storage/HA >>>>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix >> Imendörffer, >>>>>>>>>>>> HRB 21284 (AG Nürnberg) >>>>>>>>>>>>> "Experience is the name everyone gives to their mistakes." -- >> Oscar Wilde >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Linux-HA mailing list >>>>>>>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Linux-HA mailing list >>>>>>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Linux-HA mailing list >>>>>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>>>>> _______________________________________________ >>>>>>>>>> Linux-HA mailing list >>>>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>>>> _______________________________________________ >>>>>>>>> Linux-HA mailing list >>>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>>> _______________________________________________ >>>>>>>> Linux-HA mailing list >>>>>>>> Linux-HA@lists.linux-ha.org >>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>> _______________________________________________ >>>>>>> Linux-HA mailing list >>>>>>> Linux-HA@lists.linux-ha.org >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>> _______________________________________________ >>>>>> Linux-HA mailing list >>>>>> Linux-HA@lists.linux-ha.org >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>> See also: http://linux-ha.org/ReportingProblems >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> Linux-HA@lists.linux-ha.org >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> Linux-HA@lists.linux-ha.org >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >> >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems