Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Tom Parker Mon, 21 Oct 2013 08:14:49 -0700

Hi Dejan.

How can I revert my commits so that they are not include multiple
things? I will submit one patch with the logging cleanup and then if
needed another with my changes to the meta-data.


Tom

On 10/21/2013 09:39 AM, Dejan Muhamedagic wrote:
> Hi Ulrich!
>
> On Mon, Oct 21, 2013 at 09:28:50AM +0200, Ulrich Windl wrote:
>> Hi!
>>
>> Basically I think there should be no hard-coded constants whose value depends
>> on some performance measurements, like 5s for rebooting a VM.
> It's actually not 5s, but the status is run 5 times. If the load
> is high, my guess is that the Xen tools used by the RA would
> suffer proportionally.
>
>> So I support
>> Tom's changes.
>>
>> However I noticed:
>>
>> +running; apparently, this period lasts only for a second or
>> +two
>>
>> (missing full stop at end of sentence)
> That's at the end of the comment and, typically, comments end
> with a carriage return (as is here the case).
>
>> Actually I'd rephrase the description:
>>
>> "When the guest is rebooting, there is a short interval where the guest
>> completely disappears from "xm list", which, in turn, will cause the monitor
>> operation to return a "not running" status. If the guest cannot be found , 
>> this
>> value will cause some extra delay in the monitor operation to work around the
>> problem."
>>
>> (I.e. try to describe the effect, not the implementation)
> That's the code, so the implementation is described. The very
> top of the comment says:
>
>       # If the guest is rebooting, it may completely disappear from the
>       # list of defined guests
>
> I was hoping that that was enough of an explanation. Look for
> a more thorough description of the cause in the changelog. BTW,
> note that this is a _workaround_ and that the thing should
> eventually be fixed in Xen.
>
>> And yes, I appreciate consistent log formats also ;-)
> That's always welcome, of course. It should also go in a
> separate commit.
>
> Thanks,
>
> Dejan
>
>> Regards,
>> Ulrich
>>
>>>>> Tom Parker <tpar...@cbnco.com> schrieb am 18.10.2013 um 19:30 in
>> Nachricht
>> <5261703a.5070...@cbnco.com>:
>>> Hi Dejan.  Sorry to be slow to respond to this.  I have done some
>>> testing and everything looks good. 
>>>
>>> I spent some time tweaking the RA and I added a parameter called
>>> wait_for_reboot (default 5s) to allow us to override the reboot sleep
>>> times (in case it's more than 5 seconds on really loaded hypervisors). 
>>> I also cleaned up a few log entries to make them consistent in the RA
>>> and edited your entries for xen status to be a little bit more clear as
>>> to why we think we should be waiting. 
>>>
>>> I have attached a patch here because I have NO idea how to create a
>>> branch and pull request.  If there are links to a good place to start I
>>> may be able to contribute occasionally to some other RAs that I use.
>>>
>>> Please let me know what you think.
>>>
>>> Thanks for your help
>>>
>>> Tom
>>>
>>>
>>> On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
>>>> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
>>>>> Hi Tom,
>>>>>
>>>>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
>>>>>> Some more reading of the source code makes me think the " || [
>>>>>> "$__OCF_ACTION" != "stop" ]; "is not needed. 
>>>>> Yes, you're right. I'll drop that part of the if statement. Many
>>>>> thanks for testing.
>>>> Fixed now. The if statement, which was obviously hard to follow,
>>>> got relegated to the monitor function.  Which makes the
>>>> Xen_Status_with_Retry really stand for what's happening in there ;-)
>>>>
>>>> Tom, hope you can test again.
>>>>
>>>> Cheers,
>>>>
>>>> Dejan
>>>>
>>>>> Cheers,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we only
>>>>>> need to check if it's a probe.  Everything else should be handled in the
>>>>>> case statement in the loop.
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> On 10/16/2013 05:16 PM, Tom Parker wrote:
>>>>>>> Hi.  I think there is an issue with the Updated Xen RA.
>>>>>>>
>>>>>>> I think there is an issue with the if statement here but I am not sure.
>>>>>>> I may be confused about how bash || works but I don't see my servers
>>>>>>> ever entering the loop on a vm disappearing.
>>>>>>>
>>>>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>>>>         return $rc
>>>>>>> fi
>>>>>>>
>>>>>>> Does this not mean that if we run a monitor operation that is not a
>>>>>>> probe we will have:
>>>>>>>
>>>>>>> (ocf_is_probe) return false
>>>>>>> (stop != monitor) return true
>>>>>>> (false || true) return true
>>>>>>>
>>>>>>> which will cause the if statement to return $rc and never enter the
>> loop? 
>>>>>>> Xen_Status_with_Retry() {
>>>>>>>   local rc cnt=5
>>>>>>>
>>>>>>>   Xen_Status $1
>>>>>>>   rc=$?
>>>>>>>   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>>>>         return $rc
>>>>>>>   fi
>>>>>>>   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
>>>>>>>         case "$__OCF_ACTION" in
>>>>>>>         stop)
>>>>>>>           ocf_log debug "domain $1 reported as not running, waiting
>> $cnt
>>>>>>> seconds ..."
>>>>>>>           ;;
>>>>>>>         monitor)
>>>>>>>           ocf_log warn "domain $1 reported as not running, but it is
>>>>>>> expected to be running! Retrying for $cnt seconds ..."
>>>>>>>           ;;
>>>>>>>         *) : not reachable
>>>>>>>                 ;;
>>>>>>>         esac
>>>>>>>         sleep 1
>>>>>>>         Xen_Status $1
>>>>>>>         rc=$?
>>>>>>>         let cnt=$((cnt-1))
>>>>>>>   done
>>>>>>>   return $rc
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
>>>>>>>> Hi Tom,
>>>>>>>>
>>>>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>>>>>>>>> Hi Dejan
>>>>>>>>>
>>>>>>>>> Just a quick question.  I cannot see your new log messages being
>> logged
>>>>>>>>> to syslog
>>>>>>>>>
>>>>>>>>> ocf_log warn "domain $1 reported as not running, but it is expected
>> to
>>>>>>>>> be running! Retrying for $cnt seconds ...
>>>>>>>>>
>>>>>>>>> Do you know where I can set my logging to see warn level messages?  I
>>>>>>>>> expected to see them in my testing by default but that does not seem
>> to
>>>>>>>>> be true.
>>>>>>>> You should see them by default. But note that these warnings may
>>>>>>>> not happen, depending on the circumstances on your host. In my
>>>>>>>> experiments they were logged only while the guest was rebooting
>>>>>>>> and then just once or maybe twice. If you have recent
>>>>>>>> resource-agents and crmsh, you can enable operation tracing (with
>>>>>>>> crm resource trace <rsc> monitor <interval>).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dejan
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>>>>>>>>> Hi!
>>>>>>>>>>>
>>>>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now
>> I'm
>>>>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and
>> pygrub is
>>>>>>>>>>> still counting down for actual boot...
>>>>>>>>>>>
>>>>>>>>>>> But the reason why I'm writing is that I think I've discovered
>> another bug 
>>> in
>>>>>>>>>>> the RA:
>>>>>>>>>>>
>>>>>>>>>>> CRM decided to "recover" the guest VM "v02":
>>>>>>>>>>> [...]
>>>>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client
>> 14906:
>>>>>>>>>>> pid 19516 exited with return code 7
>>>>>>>>>>> [...]
>>>>>>>>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started
>> h05)
>>>>>>>>>>> [...]
>>>>>>>>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>>>>>>>>> prm_xen_v02_stop_0 on h05 (local)
>>>>>>>>>>> [...]
>>>>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>>>>>>>>> [...]
>>>>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client
>> 14906: pid
>>>>>>>>>>> 19552 exited with return code 0
>>>>>>>>>>> [...]
>>>>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>>>>>>>>> prm_xen_v02_start_0 on h05 (local)
>>>>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>>>>>>>>> [...]
>>>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error:
>> Domain 
>>> 'v02'
>>>>>>>>>>> already exists with ID '3'
>>>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using
>> config file
>>>>>>>>>>> "/etc/xen/vm/v02".
>>>>>>>>>>> [...]
>>>>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client
>> 14906: 
>>> pid
>>>>>>>>>>> 19686 exited with return code 1
>>>>>>>>>>> [...]
>>>>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation
>> prm_xen_v02_start_0
>>>>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0)
>> on h05
>>>>>>>>>>> failed (target: 0 vs. rc: 1): Error
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>> As you can clearly see "start" failed, because the guest was found
>> up 
>>> already!
>>>>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2:
>> resource-agents-3.9.4-0.26.84).
>>>>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>>>>>>>>> domain being gone for a while and then reappearing.
>>>>>>>>>>
>>>>>>>>>>> I guess the following test is problematic:
>>>>>>>>>>> ---
>>>>>>>>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>>>>>>>>   rc=$?
>>>>>>>>>>>   if [ $rc -ne 0 ]; then
>>>>>>>>>>>     return $OCF_ERR_GENERIC
>>>>>>>>>>> ---
>>>>>>>>>>> Here "xm create" probably fails if the guest is already created...
>>>>>>>>>> It should fail too. Note that this is a race, but the race is
>>>>>>>>>> anyway caused by the strange behaviour of xen. With the recent
>>>>>>>>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Dejan
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Ulrich
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um
>> 12:24 in
>>>>>>>>>>> Nachricht <20131001102430.GA4687@walrus.homenet>:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree
>> wrote:
>>>>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I
>> am
>>>>>>>>>>>>>> sure I am not the only one with this issue.  For now I have set
>> all my
>>>>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing
>> them but
>>>>>>>>>>>>>> this is not super clean as I get failures in my logs that are not
>> really
>>>>>>>>>>>>>> failures.
>>>>>>>>>>>>> It is very much a severe bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're also
>> pushing
>>>>>>>>>>>> Take a look here:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Dejan
>>>>>>>>>>>>
>>>>>>>>>>>>> the Xen team (where the real problem is) to investigate and fix.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>     Lars
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Architect Storage/HA
>>>>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>> Imendörffer,
>>>>>>>>>>>> HRB 21284 (AG Nürnberg)
>>>>>>>>>>>>> "Experience is the name everyone gives to their mistakes." --
>> Oscar Wilde
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> Linux-HA@lists.linux-ha.org 
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> Linux-HA@lists.linux-ha.org 
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>> See also: http://linux-ha.org/ReportingProblems 
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to