Re: vms stopped while restarted by user

2014-07-18 Thread Tomasz Zięba
Hello,

After analyzing the code, I am convinced that 10 minutes is indeed
associated with the parameters that gave Daan, and specifically in:

class: HighAvailabilityManagerImpl.java

function: protected Long stopVM(final HaWorkVO work) throws
ConcurrentOperationException
..
work.setTimesTried(work.getTimesTried() + 1);
if (s_logger.isDebugEnabled()) {
s_logger.debug(Stop was unsuccessful.  Rescheduling);
}
return (System.currentTimeMillis()  10) + _stopRetryInterval;
.

where:
value = params.get(stop.retry.interval);
_stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60);



However, I believe that the problem lies elsewhere than in the wrong value
of these parameters.
In any case, where machine has a problem with Reschduling, HAworker process
is preceded by message, ie:

2014-07-17 12:43:27,800 DEBUG [vm.dao.VMInstanceDaoImpl]
(HA-Worker-0:work-2824) Unable to update VM[User|centos-min-10G-test]: DB
Data={Host=null; State=Stopped; updated=5; time=Thu Jul 17 12:43:27 CEST
2014} New Data: {Host=null; State=Stopped; updated=5; time=Thu Jul 17
12:43:27 CEST 2014} Stale Data: {Host=58; State=Stopping; updated=4;
time=Thu Jul 17 12:43:15 CEST 2014}

For entry to the log corresponds to the:

class: VMInstanceDaoImpl.java
function: public boolean updateState(State oldState, Event event, State
newState, VirtualMachine vm, Object opaque)

However, at this moment we can not diagnose what is the cause.

Maybe someone has an idea?




2014-07-15 21:12 GMT+02:00 Chiradeep Vittal chiradeep.vit...@citrix.com:

  Agree. Not sure why your system is so slow, but these parameters should
 help

   From: Daan Hoogland daan.hoogl...@gmail.com
 Reply-To: dev@cloudstack.apache.org dev@cloudstack.apache.org
 Date: Tuesday, July 15, 2014 at 6:29 AM
 To: Tomasz Zięba t.a.zi...@gmail.com
 Cc: dev@cloudstack.apache.org dev@cloudstack.apache.org, Marcus
 Sorensen shadow...@gmail.com, Damoder Reddy damoder.re...@citrix.com
 Subject: vms stopped while restarted by user

   Tomasz,

  I can only fantasize on the full rationale of the implementation of
 the retry but in general it makes sense to me. A job has a time to try
 and a times tried field. the worker manager has time to sleep and max
 retries. As you can see below these are read from the configuration:

  value = params.get(time.to.sleep);
 _timeToSleep = NumbersUtil.parseInt(value, 60) * 1000;

  value = params.get(max.retries);
 _maxRetries = NumbersUtil.parseInt(value, 5);

  there is also

  value = params.get(stop.retry.interval);
 _stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60);


  The time.to.sleep and stop.retry.interval seem to jointly explain the
 ten minute scenario you described in the bug report. They don't do
 completely as some of the handling of the values is based on
 bitshifting and not on datetime calculus (using mixed factors of
 1000,60,60,24 and 365.25)
 You can try and play with those to tune your setting. In any case
 looking at the vm to decide to restart the vm is not usefull as
 Cloudstack will do some cleanup after stopping the instance. You
 should really wait untill cloudstack reports on the job with either
 succes or error.

  On Tue, Jul 15, 2014 at 3:12 PM, Tomasz Zięba t.a.zi...@gmail.com
 wrote:

 Hello,

  The user does not receive confirmation of the operation.
 From the point of view of user input it looks like the machine itself
 stopped.

  As you can see in the logs, the ACS explicitly sends stop command, as if
 they press the Stop button from the GUI, so it is aware of the action from
 the perspective of the ACS / MS.

  I can not point out which component may be responsible for it.
 We have tried to analyze the code to understand what is happening,
 but the part of the code related to HAWorker is not very clear.
 Unfortunately we could not find online any assumptions on the level of
 architecture / design of HAWorker.

  Maybe method of small steps help find a solution.
 First a small question: why HAWorker performs reschedule. What was the idea
 for such action.




  2014-07-15 14:26 GMT+02:00 Daan Hoogland daan.hoogl...@gmail.com:

  Tomasz,

  As I understand the issue this is what happens:

  The user stops the vm from the UI
 The MS sends the stop command to the machine
 The machine stops and takes a long time for it
 The MS reschedules the stop
 Then machine stops
 the user starts the machine
 the MS get by stopping the machine

  Did the user ever get a confirmation that the machine was stopped or
 that stopping failed? If so, this is the bug, as it seems the MS works
 as designed.

  Don't get me wrong; I am trying to figure out a path to a solution for
 you. I am not convinced there is a bug in the management server
 though. That doesn't mean it can be in cloudstack over all. Either at
 a design level or for instance in some inter-process communication.

  kind regards,
 Daan Hoogland


  On Fri, Jul 11, 2014 at 2:45 PM, Tomasz Zięba t.a.zi...@gmail.com

Re: vms stopped while restarted by user

2014-07-15 Thread Chiradeep Vittal
Agree. Not sure why your system is so slow, but these parameters should help

From: Daan Hoogland daan.hoogl...@gmail.commailto:daan.hoogl...@gmail.com
Reply-To: dev@cloudstack.apache.orgmailto:dev@cloudstack.apache.org 
dev@cloudstack.apache.orgmailto:dev@cloudstack.apache.org
Date: Tuesday, July 15, 2014 at 6:29 AM
To: Tomasz Zięba t.a.zi...@gmail.commailto:t.a.zi...@gmail.com
Cc: dev@cloudstack.apache.orgmailto:dev@cloudstack.apache.org 
dev@cloudstack.apache.orgmailto:dev@cloudstack.apache.org, Marcus Sorensen 
shadow...@gmail.commailto:shadow...@gmail.com, Damoder Reddy 
damoder.re...@citrix.commailto:damoder.re...@citrix.com
Subject: vms stopped while restarted by user

Tomasz,

I can only fantasize on the full rationale of the implementation of
the retry but in general it makes sense to me. A job has a time to try
and a times tried field. the worker manager has time to sleep and max
retries. As you can see below these are read from the configuration:

value = params.get(time.to.sleep);
_timeToSleep = NumbersUtil.parseInt(value, 60) * 1000;

value = params.get(max.retries);
_maxRetries = NumbersUtil.parseInt(value, 5);

there is also

value = params.get(stop.retry.interval);
_stopRetryInterval = NumbersUtil.parseInt(value, 10 * 60);


The time.to.sleep and stop.retry.interval seem to jointly explain the
ten minute scenario you described in the bug report. They don't do
completely as some of the handling of the values is based on
bitshifting and not on datetime calculus (using mixed factors of
1000,60,60,24 and 365.25)
You can try and play with those to tune your setting. In any case
looking at the vm to decide to restart the vm is not usefull as
Cloudstack will do some cleanup after stopping the instance. You
should really wait untill cloudstack reports on the job with either
succes or error.

On Tue, Jul 15, 2014 at 3:12 PM, Tomasz Zięba 
t.a.zi...@gmail.commailto:t.a.zi...@gmail.com wrote:
Hello,

The user does not receive confirmation of the operation.
From the point of view of user input it looks like the machine itself
stopped.

As you can see in the logs, the ACS explicitly sends stop command, as if
they press the Stop button from the GUI, so it is aware of the action from
the perspective of the ACS / MS.

I can not point out which component may be responsible for it.
We have tried to analyze the code to understand what is happening,
but the part of the code related to HAWorker is not very clear.
Unfortunately we could not find online any assumptions on the level of
architecture / design of HAWorker.

Maybe method of small steps help find a solution.
First a small question: why HAWorker performs reschedule. What was the idea
for such action.




2014-07-15 14:26 GMT+02:00 Daan Hoogland 
daan.hoogl...@gmail.commailto:daan.hoogl...@gmail.com:

Tomasz,

As I understand the issue this is what happens:

The user stops the vm from the UI
The MS sends the stop command to the machine
The machine stops and takes a long time for it
The MS reschedules the stop
Then machine stops
the user starts the machine
the MS get by stopping the machine

Did the user ever get a confirmation that the machine was stopped or
that stopping failed? If so, this is the bug, as it seems the MS works
as designed.

Don't get me wrong; I am trying to figure out a path to a solution for
you. I am not convinced there is a bug in the management server
though. That doesn't mean it can be in cloudstack over all. Either at
a design level or for instance in some inter-process communication.

kind regards,
Daan Hoogland


On Fri, Jul 11, 2014 at 2:45 PM, Tomasz Zięba 
t.a.zi...@gmail.commailto:t.a.zi...@gmail.com wrote:
 Hello,

 We are waiting for the patch with longingly.

 Error associated with self-closing of machines causes very serious
 complications, both from the technical (users need to wait for 10
 minutes
 and check if the machine is not closed automatically) as well as the
 business side (this problem does not look very professional from the
 user
 side)

 Given that:
 - An error has been detected in February so 5 months ago,
 - in earlier versions  (3.0.2) error does not exists,
 - there is a procedure to reproduce this error,

 we would be very grateful if this issue will be resolved in ACS4.4.


 --
 Regards,
 Tomasz Zięba
 Twitter: @TZieba
 LinkedIn: pl.linkedin.com/pub/tomasz-zięba-ph-d/3b/7a8/ab6/



--
Daan