Hello, After analyzing the code, I am convinced that 10 minutes is indeed associated with the parameters that gave Daan, and specifically in:
class: HighAvailabilityManagerImpl.java function: protected Long stopVM(final HaWorkVO work) throws ConcurrentOperationException ...... work.setTimesTried(work.getTimesTried() + 1); if (s_logger.isDebugEnabled()) { s_logger.debug("Stop was unsuccessful. Rescheduling"); } return (System.currentTimeMillis() >> 10) + _stopRetryInterval; ..... where: value = params.get("stop.retry.interval"); _stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60); However, I believe that the problem lies elsewhere than in the wrong value of these parameters. In any case, where machine has a problem with Reschduling, HAworker process is preceded by message, ie: 2014-07-17 12:43:27,800 DEBUG [vm.dao.VMInstanceDaoImpl] (HA-Worker-0:work-2824) Unable to update VM[User|centos-min-10G-test]: DB Data={Host=null; State=Stopped; updated=5; time=Thu Jul 17 12:43:27 CEST 2014} New Data: {Host=null; State=Stopped; updated=5; time=Thu Jul 17 12:43:27 CEST 2014} Stale Data: {Host=58; State=Stopping; updated=4; time=Thu Jul 17 12:43:15 CEST 2014} For entry to the log corresponds to the: class: VMInstanceDaoImpl.java function: public boolean updateState(State oldState, Event event, State newState, VirtualMachine vm, Object opaque) However, at this moment we can not diagnose what is the cause. Maybe someone has an idea? 2014-07-15 21:12 GMT+02:00 Chiradeep Vittal <chiradeep.vit...@citrix.com>: > Agree. Not sure why your system is so slow, but these parameters should > help > > From: Daan Hoogland <daan.hoogl...@gmail.com> > Reply-To: "dev@cloudstack.apache.org" <dev@cloudstack.apache.org> > Date: Tuesday, July 15, 2014 at 6:29 AM > To: Tomasz Zięba <t.a.zi...@gmail.com> > Cc: "dev@cloudstack.apache.org" <dev@cloudstack.apache.org>, Marcus > Sorensen <shadow...@gmail.com>, Damoder Reddy <damoder.re...@citrix.com> > Subject: vms stopped while restarted by user > > Tomasz, > > I can only fantasize on the full rationale of the implementation of > the retry but in general it makes sense to me. A job has a time to try > and a times tried field. the worker manager has time to sleep and max > retries. As you can see below these are read from the configuration: > > value = params.get("time.to.sleep"); > _timeToSleep = NumbersUtil.parseInt(value, 60) * 1000; > > value = params.get("max.retries"); > _maxRetries = NumbersUtil.parseInt(value, 5); > > there is also > > value = params.get("stop.retry.interval"); > _stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60); > > > The time.to.sleep and stop.retry.interval seem to jointly explain the > ten minute scenario you described in the bug report. They don't do > completely as some of the handling of the values is based on > bitshifting and not on datetime calculus (using mixed factors of > 1000,60,60,24 and 365.25) > You can try and play with those to tune your setting. In any case > looking at the vm to decide to restart the vm is not usefull as > Cloudstack will do some cleanup after stopping the instance. You > should really wait untill cloudstack reports on the job with either > succes or error. > > On Tue, Jul 15, 2014 at 3:12 PM, Tomasz Zięba <t.a.zi...@gmail.com> > wrote: > > Hello, > > The user does not receive confirmation of the operation. > From the point of view of user input it looks like the machine itself > stopped. > > As you can see in the logs, the ACS explicitly sends stop command, as if > they press the Stop button from the GUI, so it is aware of the action from > the perspective of the ACS / MS. > > I can not point out which component may be responsible for it. > We have tried to analyze the code to understand what is happening, > but the part of the code related to HAWorker is not very clear. > Unfortunately we could not find online any assumptions on the level of > architecture / design of HAWorker. > > Maybe method of small steps help find a solution. > First a small question: why HAWorker performs reschedule. What was the idea > for such action. > > > > > 2014-07-15 14:26 GMT+02:00 Daan Hoogland <daan.hoogl...@gmail.com>: > > Tomasz, > > As I understand the issue this is what happens: > > The user stops the vm from the UI > The MS sends the stop command to the machine > The machine stops and takes a long time for it > The MS reschedules the stop > Then machine stops > the user starts the machine > the MS get by stopping the machine > > Did the user ever get a confirmation that the machine was stopped or > that stopping failed? If so, this is the bug, as it seems the MS works > as designed. > > Don't get me wrong; I am trying to figure out a path to a solution for > you. I am not convinced there is a bug in the management server > though. That doesn't mean it can be in cloudstack over all. Either at > a design level or for instance in some inter-process communication. > > kind regards, > Daan Hoogland > > > On Fri, Jul 11, 2014 at 2:45 PM, Tomasz Zięba <t.a.zi...@gmail.com> > wrote: > > Hello, > > > > We are waiting for the patch with longingly. > > > > Error associated with self-closing of machines causes very serious > > complications, both from the technical (users need to wait for 10 > > minutes > > and check if the machine is not closed automatically) as well as the > > business side (this problem does not look very professional from the > > user > > side) > > > > Given that: > > - An error has been detected in February so 5 months ago, > > - in earlier versions (3.0.2) error does not exists, > > - there is a procedure to reproduce this error, > > > > we would be very grateful if this issue will be resolved in ACS4.4. > > > > > > -- > > Regards, > > Tomasz Zięba > > Twitter: @TZieba > > LinkedIn: pl.linkedin.com/pub/tomasz-zięba-ph-d/3b/7a8/ab6/ > <http://pl.linkedin.com/pub/tomasz-zi%C4%99ba-ph-d/3b/7a8/ab6/> > > > > > > -- > Daan > > -- Regards, Tomasz Zięba Twitter: @TZieba LinkedIn: pl.linkedin.com/pub/tomasz-zięba-ph-d/3b/7a8/ab6/ <http://pl.linkedin.com/pub/tomasz-zi%C4%99ba-ph-d/3b/7a8/ab6/>