Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Tom Parker Wed, 16 Oct 2013 14:18:08 -0700

Hi.  I think there is an issue with the Updated Xen RA.

I think there is an issue with the if statement here but I am not sure. 
I may be confused about how bash || works but I don't see my servers
ever entering the loop on a vm disappearing.


if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
        return $rc
fi

Does this not mean that if we run a monitor operation that is not a
probe we will have:

(ocf_is_probe) return false
(stop != monitor) return true
(false || true) return true

which will cause the if statement to return $rc and never enter the loop? 

Xen_Status_with_Retry() {
  local rc cnt=5

  Xen_Status $1
  rc=$?
  if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
        return $rc
  fi
  while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
        case "$__OCF_ACTION" in
        stop)
          ocf_log debug "domain $1 reported as not running, waiting $cnt
seconds ..."
          ;;
        monitor)
          ocf_log warn "domain $1 reported as not running, but it is
expected to be running! Retrying for $cnt seconds ..."
          ;;
        *) : not reachable
                ;;
        esac
        sleep 1
        Xen_Status $1
        rc=$?
        let cnt=$((cnt-1))
  done
  return $rc
}



On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
> Hi Tom,
>
> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>> Hi Dejan
>>
>> Just a quick question.  I cannot see your new log messages being logged
>> to syslog
>>
>> ocf_log warn "domain $1 reported as not running, but it is expected to
>> be running! Retrying for $cnt seconds ...
>>
>> Do you know where I can set my logging to see warn level messages?  I
>> expected to see them in my testing by default but that does not seem to
>> be true.
> You should see them by default. But note that these warnings may
> not happen, depending on the circumstances on your host. In my
> experiments they were logged only while the guest was rebooting
> and then just once or maybe twice. If you have recent
> resource-agents and crmsh, you can enable operation tracing (with
> crm resource trace <rsc> monitor <interval>).
>
> Thanks,
>
> Dejan
>
>> Thanks
>>
>> Tom
>>
>>
>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>> Hi,
>>>
>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>> Hi!
>>>>
>>>> I thought, I'll never be bitten by this bug, but I actually was! Now I'm
>>>> wondering whether the Xen RA sees the guest if you use pygrub, and pygrub 
>>>> is
>>>> still counting down for actual boot...
>>>>
>>>> But the reason why I'm writing is that I think I've discovered another bug 
>>>> in
>>>> the RA:
>>>>
>>>> CRM decided to "recover" the guest VM "v02":
>>>> [...]
>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
>>>> pid 19516 exited with return code 7
>>>> [...]
>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
>>>> [...]
>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>> prm_xen_v02_stop_0 on h05 (local)
>>>> [...]
>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>> [...]
>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: 
>>>> pid
>>>> 19552 exited with return code 0
>>>> [...]
>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>> prm_xen_v02_start_0 on h05 (local)
>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>> [...]
>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
>>>> 'v02'
>>>> already exists with ID '3'
>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config 
>>>> file
>>>> "/etc/xen/vm/v02".
>>>> [...]
>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: 
>>>> pid
>>>> 19686 exited with return code 1
>>>> [...]
>>>> crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
>>>> failed (target: 0 vs. rc: 1): Error
>>>> [...]
>>>>
>>>> As you can clearly see "start" failed, because the guest was found up 
>>>> already!
>>>> IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>> domain being gone for a while and then reappearing.
>>>
>>>> I guess the following test is problematic:
>>>> ---
>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>   rc=$?
>>>>   if [ $rc -ne 0 ]; then
>>>>     return $OCF_ERR_GENERIC
>>>> ---
>>>> Here "xm create" probably fails if the guest is already created...
>>> It should fail too. Note that this is a race, but the race is
>>> anyway caused by the strange behaviour of xen. With the recent
>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>>> Regards,
>>>> Ulrich
>>>>
>>>>
>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um 12:24 
>>>>>>> in
>>>> Nachricht <20131001102430.GA4687@walrus.homenet>:
>>>>> Hi,
>>>>>
>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote:
>>>>>>
>>>>>>> Thanks for paying attention to this issue (not really a bug) as I am
>>>>>>> sure I am not the only one with this issue.  For now I have set all my
>>>>>>> VMs to destroy so that the cluster is the only thing managing them but
>>>>>>> this is not super clean as I get failures in my logs that are not really
>>>>>>> failures.
>>>>>> It is very much a severe bug.
>>>>>>
>>>>>> The Xen RA has gained a workaround for this now, but we're also pushing
>>>>> Take a look here:
>>>>>
>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> the Xen team (where the real problem is) to investigate and fix.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>     Lars
>>>>>>
>>>>>> -- 
>>>>>> Architect Storage/HA
>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
>>>>>> Imendörffer,
>>>>> HRB 21284 (AG Nürnberg)
>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> Linux-HA@lists.linux-ha.org 
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> Linux-HA@lists.linux-ha.org
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to