Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Dejan Muhamedagic Wed, 16 Oct 2013 09:13:27 -0700

Hi Tom,

On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
> Hi Dejan
> 
> Just a quick question.  I cannot see your new log messages being logged
> to syslog
> 
> ocf_log warn "domain $1 reported as not running, but it is expected to
> be running! Retrying for $cnt seconds ...
> 
> Do you know where I can set my logging to see warn level messages?  I
> expected to see them in my testing by default but that does not seem to
> be true.


You should see them by default. But note that these warnings may
not happen, depending on the circumstances on your host. In my
experiments they were logged only while the guest was rebooting
and then just once or maybe twice. If you have recent
resource-agents and crmsh, you can enable operation tracing (with
crm resource trace <rsc> monitor <interval>).

Thanks,

Dejan

> Thanks
> 
> Tom
> 
> 
> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
> >> Hi!
> >>
> >> I thought, I'll never be bitten by this bug, but I actually was! Now I'm
> >> wondering whether the Xen RA sees the guest if you use pygrub, and pygrub 
> >> is
> >> still counting down for actual boot...
> >>
> >> But the reason why I'm writing is that I think I've discovered another bug 
> >> in
> >> the RA:
> >>
> >> CRM decided to "recover" the guest VM "v02":
> >> [...]
> >> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
> >> pid 19516 exited with return code 7
> >> [...]
> >>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
> >> [...]
> >>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
> >> prm_xen_v02_stop_0 on h05 (local)
> >> [...]
> >> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
> >> [...]
> >> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: 
> >> pid
> >> 19552 exited with return code 0
> >> [...]
> >> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
> >> prm_xen_v02_start_0 on h05 (local)
> >> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
> >> [...]
> >> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
> >> 'v02'
> >> already exists with ID '3'
> >> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config 
> >> file
> >> "/etc/xen/vm/v02".
> >> [...]
> >> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: 
> >> pid
> >> 19686 exited with return code 1
> >> [...]
> >> crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
> >> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
> >> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
> >> failed (target: 0 vs. rc: 1): Error
> >> [...]
> >>
> >> As you can clearly see "start" failed, because the guest was found up 
> >> already!
> >> IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
> > Yes, I've seen that. It's basically the same issue, i.e. the
> > domain being gone for a while and then reappearing.
> >
> >> I guess the following test is problematic:
> >> ---
> >>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
> >>   rc=$?
> >>   if [ $rc -ne 0 ]; then
> >>     return $OCF_ERR_GENERIC
> >> ---
> >> Here "xm create" probably fails if the guest is already created...
> > It should fail too. Note that this is a race, but the race is
> > anyway caused by the strange behaviour of xen. With the recent
> > fix (or workaround) in the RA, this shouldn't be happening.
> >
> > Thanks,
> >
> > Dejan
> >
> >> Regards,
> >> Ulrich
> >>
> >>
> >>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um 12:24 
> >>>>> in
> >> Nachricht <20131001102430.GA4687@walrus.homenet>:
> >>> Hi,
> >>>
> >>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
> >>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote:
> >>>>
> >>>>> Thanks for paying attention to this issue (not really a bug) as I am
> >>>>> sure I am not the only one with this issue.  For now I have set all my
> >>>>> VMs to destroy so that the cluster is the only thing managing them but
> >>>>> this is not super clean as I get failures in my logs that are not really
> >>>>> failures.
> >>>> It is very much a severe bug.
> >>>>
> >>>> The Xen RA has gained a workaround for this now, but we're also pushing
> >>> Take a look here:
> >>>
> >>> https://github.com/ClusterLabs/resource-agents/pull/314 
> >>>
> >>> Thanks,
> >>>
> >>> Dejan
> >>>
> >>>> the Xen team (where the real problem is) to investigate and fix.
> >>>>
> >>>>
> >>>> Regards,
> >>>>     Lars
> >>>>
> >>>> -- 
> >>>> Architect Storage/HA
> >>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
> >>>> Imendörffer,
> >>> HRB 21284 (AG Nürnberg)
> >>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> >>>>
> >>>> _______________________________________________
> >>>> Linux-HA mailing list
> >>>> Linux-HA@lists.linux-ha.org 
> >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>> See also: http://linux-ha.org/ReportingProblems 
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> Linux-HA@lists.linux-ha.org 
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>> See also: http://linux-ha.org/ReportingProblems 
> >>
> >> _______________________________________________
> >> Linux-HA mailing list
> >> Linux-HA@lists.linux-ha.org
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to