Re: vms stopped while restarted by user

Tomasz Zięba Fri, 18 Jul 2014 08:38:14 -0700

Hello,

After analyzing the code, I am convinced that 10 minutes is indeed
associated with the parameters that gave Daan, and specifically in:


class: HighAvailabilityManagerImpl.java

function: protected Long stopVM(final HaWorkVO work) throws
ConcurrentOperationException
......
work.setTimesTried(work.getTimesTried() + 1);
if (s_logger.isDebugEnabled()) {
    s_logger.debug("Stop was unsuccessful.  Rescheduling");
}
return (System.currentTimeMillis() >> 10) + _stopRetryInterval;
.....

where:
value = params.get("stop.retry.interval");
_stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60);



However, I believe that the problem lies elsewhere than in the wrong value
of these parameters.
In any case, where machine has a problem with Reschduling, HAworker process
is preceded by message, ie:

2014-07-17 12:43:27,800 DEBUG [vm.dao.VMInstanceDaoImpl]
(HA-Worker-0:work-2824) Unable to update VM[User|centos-min-10G-test]: DB
Data={Host=null; State=Stopped; updated=5; time=Thu Jul 17 12:43:27 CEST
2014} New Data: {Host=null; State=Stopped; updated=5; time=Thu Jul 17
12:43:27 CEST 2014} Stale Data: {Host=58; State=Stopping; updated=4;
time=Thu Jul 17 12:43:15 CEST 2014}

For entry to the log corresponds to the:

class: VMInstanceDaoImpl.java
function: public boolean updateState(State oldState, Event event, State
newState, VirtualMachine vm, Object opaque)

However, at this moment we can not diagnose what is the cause.

Maybe someone has an idea?




2014-07-15 21:12 GMT+02:00 Chiradeep Vittal <chiradeep.vit...@citrix.com>:

>  Agree. Not sure why your system is so slow, but these parameters should
> help
>
>   From: Daan Hoogland <daan.hoogl...@gmail.com>
> Reply-To: "dev@cloudstack.apache.org" <dev@cloudstack.apache.org>
> Date: Tuesday, July 15, 2014 at 6:29 AM
> To: Tomasz Zięba <t.a.zi...@gmail.com>
> Cc: "dev@cloudstack.apache.org" <dev@cloudstack.apache.org>, Marcus
> Sorensen <shadow...@gmail.com>, Damoder Reddy <damoder.re...@citrix.com>
> Subject: vms stopped while restarted by user
>
>   Tomasz,
>
>  I can only fantasize on the full rationale of the implementation of
> the retry but in general it makes sense to me. A job has a time to try
> and a times tried field. the worker manager has time to sleep and max
> retries. As you can see below these are read from the configuration:
>
>          value = params.get("time.to.sleep");
>         _timeToSleep = NumbersUtil.parseInt(value, 60) * 1000;
>
>          value = params.get("max.retries");
>         _maxRetries = NumbersUtil.parseInt(value, 5);
>
>  there is also
>
>          value = params.get("stop.retry.interval");
>         _stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60);
>
>
>  The time.to.sleep and stop.retry.interval seem to jointly explain the
> ten minute scenario you described in the bug report. They don't do
> completely as some of the handling of the values is based on
> bitshifting and not on datetime calculus (using mixed factors of
> 1000,60,60,24 and 365.25)
> You can try and play with those to tune your setting. In any case
> looking at the vm to decide to restart the vm is not usefull as
> Cloudstack will do some cleanup after stopping the instance. You
> should really wait untill cloudstack reports on the job with either
> succes or error.
>
>  On Tue, Jul 15, 2014 at 3:12 PM, Tomasz Zięba <t.a.zi...@gmail.com>
> wrote:
>
> Hello,
>
>  The user does not receive confirmation of the operation.
> From the point of view of user input it looks like the machine itself
> stopped.
>
>  As you can see in the logs, the ACS explicitly sends stop command, as if
> they press the Stop button from the GUI, so it is aware of the action from
> the perspective of the ACS / MS.
>
>  I can not point out which component may be responsible for it.
> We have tried to analyze the code to understand what is happening,
> but the part of the code related to HAWorker is not very clear.
> Unfortunately we could not find online any assumptions on the level of
> architecture / design of HAWorker.
>
>  Maybe method of small steps help find a solution.
> First a small question: why HAWorker performs reschedule. What was the idea
> for such action.
>
>
>
>
>  2014-07-15 14:26 GMT+02:00 Daan Hoogland <daan.hoogl...@gmail.com>:
>
>  Tomasz,
>
>  As I understand the issue this is what happens:
>
>  The user stops the vm from the UI
> The MS sends the stop command to the machine
> The machine stops and takes a long time for it
> The MS reschedules the stop
> Then machine stops
> the user starts the machine
> the MS get by stopping the machine
>
>  Did the user ever get a confirmation that the machine was stopped or
> that stopping failed? If so, this is the bug, as it seems the MS works
> as designed.
>
>  Don't get me wrong; I am trying to figure out a path to a solution for
> you. I am not convinced there is a bug in the management server
> though. That doesn't mean it can be in cloudstack over all. Either at
> a design level or for instance in some inter-process communication.
>
>  kind regards,
> Daan Hoogland
>
>
>  On Fri, Jul 11, 2014 at 2:45 PM, Tomasz Zięba <t.a.zi...@gmail.com>
> wrote:
> > Hello,
> >
> > We are waiting for the patch with longingly.
> >
> > Error associated with self-closing of machines causes very serious
> > complications, both from the technical (users need to wait for 10
> > minutes
> > and check if the machine is not closed automatically) as well as the
> > business side (this problem does not look very professional from the
> > user
> > side)
> >
> > Given that:
> > - An error has been detected in February so 5 months ago,
> > - in earlier versions  (3.0.2) error does not exists,
> > - there is a procedure to reproduce this error,
> >
> > we would be very grateful if this issue will be resolved in ACS4.4.
> >
> >
> > --
> > Regards,
> > Tomasz Zięba
> > Twitter: @TZieba
> > LinkedIn: pl.linkedin.com/pub/tomasz-zięba-ph-d/3b/7a8/ab6/
> <http://pl.linkedin.com/pub/tomasz-zi%C4%99ba-ph-d/3b/7a8/ab6/>
> >
>
>
>
>  --
> Daan
>
>


-- 
Regards,
Tomasz Zięba
Twitter: @TZieba
LinkedIn: pl.linkedin.com/pub/tomasz-zięba-ph-d/3b/7a8/ab6/
<http://pl.linkedin.com/pub/tomasz-zi%C4%99ba-ph-d/3b/7a8/ab6/>

Re: vms stopped while restarted by user

Reply via email to