Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Dejan Muhamedagic Thu, 17 Oct 2013 03:10:38 -0700

On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
> Hi Tom,
> 
> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
> > Some more reading of the source code makes me think the " || [
> > "$__OCF_ACTION" != "stop" ]; "is not needed. 
> 
> Yes, you're right. I'll drop that part of the if statement. Many
> thanks for testing.


Fixed now. The if statement, which was obviously hard to follow,
got relegated to the monitor function.  Which makes the
Xen_Status_with_Retry really stand for what's happening in there ;-)

Tom, hope you can test again.

Cheers,

Dejan

> Cheers,
> 
> Dejan
> 
> > Xen_Status_with_Retry() is only called from Stop and Monitor so we only
> > need to check if it's a probe.  Everything else should be handled in the
> > case statement in the loop.
> > 
> > Tom
> > 
> > On 10/16/2013 05:16 PM, Tom Parker wrote:
> > > Hi.  I think there is an issue with the Updated Xen RA.
> > >
> > > I think there is an issue with the if statement here but I am not sure. 
> > > I may be confused about how bash || works but I don't see my servers
> > > ever entering the loop on a vm disappearing.
> > >
> > > if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
> > >         return $rc
> > > fi
> > >
> > > Does this not mean that if we run a monitor operation that is not a
> > > probe we will have:
> > >
> > > (ocf_is_probe) return false
> > > (stop != monitor) return true
> > > (false || true) return true
> > >
> > > which will cause the if statement to return $rc and never enter the loop? 
> > >
> > > Xen_Status_with_Retry() {
> > >   local rc cnt=5
> > >
> > >   Xen_Status $1
> > >   rc=$?
> > >   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
> > >         return $rc
> > >   fi
> > >   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
> > >         case "$__OCF_ACTION" in
> > >         stop)
> > >           ocf_log debug "domain $1 reported as not running, waiting $cnt
> > > seconds ..."
> > >           ;;
> > >         monitor)
> > >           ocf_log warn "domain $1 reported as not running, but it is
> > > expected to be running! Retrying for $cnt seconds ..."
> > >           ;;
> > >         *) : not reachable
> > >                 ;;
> > >         esac
> > >         sleep 1
> > >         Xen_Status $1
> > >         rc=$?
> > >         let cnt=$((cnt-1))
> > >   done
> > >   return $rc
> > > }
> > >
> > >
> > >
> > > On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
> > >> Hi Tom,
> > >>
> > >> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
> > >>> Hi Dejan
> > >>>
> > >>> Just a quick question.  I cannot see your new log messages being logged
> > >>> to syslog
> > >>>
> > >>> ocf_log warn "domain $1 reported as not running, but it is expected to
> > >>> be running! Retrying for $cnt seconds ...
> > >>>
> > >>> Do you know where I can set my logging to see warn level messages?  I
> > >>> expected to see them in my testing by default but that does not seem to
> > >>> be true.
> > >> You should see them by default. But note that these warnings may
> > >> not happen, depending on the circumstances on your host. In my
> > >> experiments they were logged only while the guest was rebooting
> > >> and then just once or maybe twice. If you have recent
> > >> resource-agents and crmsh, you can enable operation tracing (with
> > >> crm resource trace <rsc> monitor <interval>).
> > >>
> > >> Thanks,
> > >>
> > >> Dejan
> > >>
> > >>> Thanks
> > >>>
> > >>> Tom
> > >>>
> > >>>
> > >>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
> > >>>>> Hi!
> > >>>>>
> > >>>>> I thought, I'll never be bitten by this bug, but I actually was! Now 
> > >>>>> I'm
> > >>>>> wondering whether the Xen RA sees the guest if you use pygrub, and 
> > >>>>> pygrub is
> > >>>>> still counting down for actual boot...
> > >>>>>
> > >>>>> But the reason why I'm writing is that I think I've discovered 
> > >>>>> another bug in
> > >>>>> the RA:
> > >>>>>
> > >>>>> CRM decided to "recover" the guest VM "v02":
> > >>>>> [...]
> > >>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 
> > >>>>> 14906:
> > >>>>> pid 19516 exited with return code 7
> > >>>>> [...]
> > >>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started 
> > >>>>> h05)
> > >>>>> [...]
> > >>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
> > >>>>> prm_xen_v02_stop_0 on h05 (local)
> > >>>>> [...]
> > >>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
> > >>>>> [...]
> > >>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 
> > >>>>> 14906: pid
> > >>>>> 19552 exited with return code 0
> > >>>>> [...]
> > >>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
> > >>>>> prm_xen_v02_start_0 on h05 (local)
> > >>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
> > >>>>> [...]
> > >>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: 
> > >>>>> Domain 'v02'
> > >>>>> already exists with ID '3'
> > >>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using 
> > >>>>> config file
> > >>>>> "/etc/xen/vm/v02".
> > >>>>> [...]
> > >>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 
> > >>>>> 14906: pid
> > >>>>> 19686 exited with return code 1
> > >>>>> [...]
> > >>>>> crmd: [14906]: info: process_lrm_event: LRM operation 
> > >>>>> prm_xen_v02_start_0
> > >>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
> > >>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) 
> > >>>>> on h05
> > >>>>> failed (target: 0 vs. rc: 1): Error
> > >>>>> [...]
> > >>>>>
> > >>>>> As you can clearly see "start" failed, because the guest was found up 
> > >>>>> already!
> > >>>>> IMHO this is a bug in the RA (SLES11 SP2: 
> > >>>>> resource-agents-3.9.4-0.26.84).
> > >>>> Yes, I've seen that. It's basically the same issue, i.e. the
> > >>>> domain being gone for a while and then reappearing.
> > >>>>
> > >>>>> I guess the following test is problematic:
> > >>>>> ---
> > >>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
> > >>>>>   rc=$?
> > >>>>>   if [ $rc -ne 0 ]; then
> > >>>>>     return $OCF_ERR_GENERIC
> > >>>>> ---
> > >>>>> Here "xm create" probably fails if the guest is already created...
> > >>>> It should fail too. Note that this is a race, but the race is
> > >>>> anyway caused by the strange behaviour of xen. With the recent
> > >>>> fix (or workaround) in the RA, this shouldn't be happening.
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Dejan
> > >>>>
> > >>>>> Regards,
> > >>>>> Ulrich
> > >>>>>
> > >>>>>
> > >>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um 
> > >>>>>>>> 12:24 in
> > >>>>> Nachricht <20131001102430.GA4687@walrus.homenet>:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
> > >>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote:
> > >>>>>>>
> > >>>>>>>> Thanks for paying attention to this issue (not really a bug) as I 
> > >>>>>>>> am
> > >>>>>>>> sure I am not the only one with this issue.  For now I have set 
> > >>>>>>>> all my
> > >>>>>>>> VMs to destroy so that the cluster is the only thing managing them 
> > >>>>>>>> but
> > >>>>>>>> this is not super clean as I get failures in my logs that are not 
> > >>>>>>>> really
> > >>>>>>>> failures.
> > >>>>>>> It is very much a severe bug.
> > >>>>>>>
> > >>>>>>> The Xen RA has gained a workaround for this now, but we're also 
> > >>>>>>> pushing
> > >>>>>> Take a look here:
> > >>>>>>
> > >>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Dejan
> > >>>>>>
> > >>>>>>> the Xen team (where the real problem is) to investigate and fix.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>>     Lars
> > >>>>>>>
> > >>>>>>> -- 
> > >>>>>>> Architect Storage/HA
> > >>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
> > >>>>>>> Imendörffer,
> > >>>>>> HRB 21284 (AG Nürnberg)
> > >>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar 
> > >>>>>>> Wilde
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> Linux-HA mailing list
> > >>>>>>> Linux-HA@lists.linux-ha.org 
> > >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> > >>>>>>> See also: http://linux-ha.org/ReportingProblems 
> > >>>>>> _______________________________________________
> > >>>>>> Linux-HA mailing list
> > >>>>>> Linux-HA@lists.linux-ha.org 
> > >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> > >>>>>> See also: http://linux-ha.org/ReportingProblems 
> > >>>>> _______________________________________________
> > >>>>> Linux-HA mailing list
> > >>>>> Linux-HA@lists.linux-ha.org
> > >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >>>>> See also: http://linux-ha.org/ReportingProblems
> > >>>> _______________________________________________
> > >>>> Linux-HA mailing list
> > >>>> Linux-HA@lists.linux-ha.org
> > >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >>>> See also: http://linux-ha.org/ReportingProblems
> > >>> _______________________________________________
> > >>> Linux-HA mailing list
> > >>> Linux-HA@lists.linux-ha.org
> > >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >>> See also: http://linux-ha.org/ReportingProblems
> > >> _______________________________________________
> > >> Linux-HA mailing list
> > >> Linux-HA@lists.linux-ha.org
> > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> See also: http://linux-ha.org/ReportingProblems
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA@lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > 
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to