Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Tutkowski, Mike
Hi Ron,

We (mainly Giles and Will, from what I am aware) are still in the process of 
finalizing how many rooms we get and for how long, so – unfortunately – we 
can’t answer your questions at least at this time.

We’re making progress on that front, though.

Thanks,
Mike

On 4/5/18, 10:28 PM, "Ron Wheeler"  wrote:


By the time you go through one and write up a commentary, you have used 
quite a bit of your discretionary time.
How many days are in the review period?

How many reviewers have volunteered?

I would hope that key organizers of the conference are only reviewing 
finalists where the author has already done a revision to address the 
reviewers comments and the reviewers have given it a passing grade.

How many presentations are going to be given?
Are there any "reserved" slots for presentations that will be given on 
behalf of the PMC as official project reports such as a roadmap or 
project overview?

Ron

On 05/04/2018 9:21 PM, Will Stevens wrote:
> I need to get through a couple reviews to figure out the commitment. I 
> have been a bit slammed at the moment.
>
> On Thu, Apr 5, 2018, 9:19 PM Tutkowski, Mike, 
> > wrote:
>
> Will – What do you think? With only 26 presentations, do you think
> it would be reasonable to just ask each reviewer to review each
> one? One time that I was on one of these panels a couple years
> ago, we each reviewed the roughly dozen presentations that were
> submitted. Of course, people may not be able to spend that amount
> of time on this.
>
> > On Apr 5, 2018, at 7:14 PM, Ron Wheeler
>  > wrote:
> >
> > We still need to manage the review process and make sure that it
> is adequately staffed.
> >
> > The allocation of presentations to reviewers has to be managed
> to be sure that the reviewers have the support that they need to
> do a proper review and that the reviews get done.
> >
> > Ron
> >
> >
> >> On 05/04/2018 11:45 AM, Tutkowski, Mike wrote:
> >> Perfect…then, unless anyone has other opinions they’d like to
> share on the topic, let’s follow that approach.
> >>
> >> On 4/5/18, 9:43 AM, "Rafael Weingärtner"
> >
> wrote:
> >>
> >> That is exactly it.
> >>  On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike
> >
> >> wrote:
> >>  > Hi Rafael,
> >> >
> >> > I think as long as we (the CloudStack Community) have the
> final say on how
> >> > we fill our allotted slots in the CloudStack track of
> ApacheCon in
> >> > Montreal, then it’s perfectly fine for us to leverage
> Apache’s normal
> >> > review process to gather all the feedback from the larger
> Apache Community.
> >> >
> >> > As you say, we could wait for the feedback to come in via
> that mechanism
> >> > and then, as per Will’s earlier comments, we could
> advertise on our users@
> >> > and dev@ mailing lists when we plan to get together for a
> call and make
> >> > final decisions on the CFP.
> >> >
> >> > Is that, in fact, what you were thinking, Rafael?
> >> >
> >> > Talk to you soon,
> >> > Mike
> >> >
> >> > On 4/4/18, 2:58 PM, "Rafael Weingärtner"
> >
> >> > wrote:
> >> >
> >> > I think everybody that “raised their hands here”
> already signed up to
> >> > review.
> >> >
> >> > Mike, what about if we only gathered the reviews from
> Apache main
> >> > review
> >> > system, and then we use that to decide which
> presentations will get in
> >> > CloudStack tracks? Then, we reduce the work on our
> side (we also remove
> >> > bias…). I do believe that the review from other peers
> from Apache
> >> > community
> >> > (even the one outside from our small community) will
> be fair and
> >> > technical
> >> > (meaning, without passion and or favoritism).
> >> >
> >> > Having said that, I think we only need a small group
> of PMCs 

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Ron Wheeler


By the time you go through one and write up a commentary, you have used 
quite a bit of your discretionary time.

How many days are in the review period?

How many reviewers have volunteered?

I would hope that key organizers of the conference are only reviewing 
finalists where the author has already done a revision to address the 
reviewers comments and the reviewers have given it a passing grade.


How many presentations are going to be given?
Are there any "reserved" slots for presentations that will be given on 
behalf of the PMC as official project reports such as a roadmap or 
project overview?


Ron

On 05/04/2018 9:21 PM, Will Stevens wrote:
I need to get through a couple reviews to figure out the commitment. I 
have been a bit slammed at the moment.


On Thu, Apr 5, 2018, 9:19 PM Tutkowski, Mike, 
> wrote:


Will – What do you think? With only 26 presentations, do you think
it would be reasonable to just ask each reviewer to review each
one? One time that I was on one of these panels a couple years
ago, we each reviewed the roughly dozen presentations that were
submitted. Of course, people may not be able to spend that amount
of time on this.

> On Apr 5, 2018, at 7:14 PM, Ron Wheeler
> wrote:
>
> We still need to manage the review process and make sure that it
is adequately staffed.
>
> The allocation of presentations to reviewers has to be managed
to be sure that the reviewers have the support that they need to
do a proper review and that the reviews get done.
>
> Ron
>
>
>> On 05/04/2018 11:45 AM, Tutkowski, Mike wrote:
>> Perfect…then, unless anyone has other opinions they’d like to
share on the topic, let’s follow that approach.
>>
>> On 4/5/18, 9:43 AM, "Rafael Weingärtner"
>
wrote:
>>
>>     That is exactly it.
>>          On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike
>
>>     wrote:
>>          > Hi Rafael,
>>     >
>>     > I think as long as we (the CloudStack Community) have the
final say on how
>>     > we fill our allotted slots in the CloudStack track of
ApacheCon in
>>     > Montreal, then it’s perfectly fine for us to leverage
Apache’s normal
>>     > review process to gather all the feedback from the larger
Apache Community.
>>     >
>>     > As you say, we could wait for the feedback to come in via
that mechanism
>>     > and then, as per Will’s earlier comments, we could
advertise on our users@
>>     > and dev@ mailing lists when we plan to get together for a
call and make
>>     > final decisions on the CFP.
>>     >
>>     > Is that, in fact, what you were thinking, Rafael?
>>     >
>>     > Talk to you soon,
>>     > Mike
>>     >
>>     > On 4/4/18, 2:58 PM, "Rafael Weingärtner"
>
>>     > wrote:
>>     >
>>     >     I think everybody that “raised their hands here”
already signed up to
>>     >     review.
>>     >
>>     >     Mike, what about if we only gathered the reviews from
Apache main
>>     > review
>>     >     system, and then we use that to decide which
presentations will get in
>>     >     CloudStack tracks? Then, we reduce the work on our
side (we also remove
>>     >     bias…). I do believe that the review from other peers
from Apache
>>     > community
>>     >     (even the one outside from our small community) will
be fair and
>>     > technical
>>     >     (meaning, without passion and or favoritism).
>>     >
>>     >     Having said that, I think we only need a small group
of PMCs to gather
>>     > the
>>     >     results and out of the best ranked proposals, we pick
the ones to our
>>     >     tracks.
>>     >
>>     >     What do you (Mike) and others think?
>>     >
>>     >
>>     >     On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike <
>>     > mike.tutkow...@netapp.com >
>>     >     wrote:
>>     >
>>     >     > Hi Ron,
>>     >     >
>>     >     > I don’t actually have insight into how many people
have currently
>>     > signed
>>     >     > up online to be CFP reviewers for ApacheCon. At
present, I’m only
>>     > aware of
>>     >     > those who have responded to this e-mail chain.
>>     >     >
>>     >     > We should be able to find out more in the coming
weeks. We’re still
>>     > quite
>>     >     > early in the process.
>>     >     >
>>    

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Will Stevens
I need to get through a couple reviews to figure out the commitment. I have
been a bit slammed at the moment.

On Thu, Apr 5, 2018, 9:19 PM Tutkowski, Mike, 
wrote:

> Will – What do you think? With only 26 presentations, do you think it
> would be reasonable to just ask each reviewer to review each one? One time
> that I was on one of these panels a couple years ago, we each reviewed the
> roughly dozen presentations that were submitted. Of course, people may not
> be able to spend that amount of time on this.
>
> > On Apr 5, 2018, at 7:14 PM, Ron Wheeler 
> wrote:
> >
> > We still need to manage the review process and make sure that it is
> adequately staffed.
> >
> > The allocation of presentations to reviewers has to be managed to be
> sure that the reviewers have the support that they need to do a proper
> review and that the reviews get done.
> >
> > Ron
> >
> >
> >> On 05/04/2018 11:45 AM, Tutkowski, Mike wrote:
> >> Perfect…then, unless anyone has other opinions they’d like to share on
> the topic, let’s follow that approach.
> >>
> >> On 4/5/18, 9:43 AM, "Rafael Weingärtner" 
> wrote:
> >>
> >> That is exactly it.
> >>  On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike <
> mike.tutkow...@netapp.com>
> >> wrote:
> >>  > Hi Rafael,
> >> >
> >> > I think as long as we (the CloudStack Community) have the final
> say on how
> >> > we fill our allotted slots in the CloudStack track of ApacheCon in
> >> > Montreal, then it’s perfectly fine for us to leverage Apache’s
> normal
> >> > review process to gather all the feedback from the larger Apache
> Community.
> >> >
> >> > As you say, we could wait for the feedback to come in via that
> mechanism
> >> > and then, as per Will’s earlier comments, we could advertise on
> our users@
> >> > and dev@ mailing lists when we plan to get together for a call
> and make
> >> > final decisions on the CFP.
> >> >
> >> > Is that, in fact, what you were thinking, Rafael?
> >> >
> >> > Talk to you soon,
> >> > Mike
> >> >
> >> > On 4/4/18, 2:58 PM, "Rafael Weingärtner" <
> rafaelweingart...@gmail.com>
> >> > wrote:
> >> >
> >> > I think everybody that “raised their hands here” already
> signed up to
> >> > review.
> >> >
> >> > Mike, what about if we only gathered the reviews from Apache
> main
> >> > review
> >> > system, and then we use that to decide which presentations
> will get in
> >> > CloudStack tracks? Then, we reduce the work on our side (we
> also remove
> >> > bias…). I do believe that the review from other peers from
> Apache
> >> > community
> >> > (even the one outside from our small community) will be fair
> and
> >> > technical
> >> > (meaning, without passion and or favoritism).
> >> >
> >> > Having said that, I think we only need a small group of PMCs
> to gather
> >> > the
> >> > results and out of the best ranked proposals, we pick the
> ones to our
> >> > tracks.
> >> >
> >> > What do you (Mike) and others think?
> >> >
> >> >
> >> > On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike <
> >> > mike.tutkow...@netapp.com>
> >> > wrote:
> >> >
> >> > > Hi Ron,
> >> > >
> >> > > I don’t actually have insight into how many people have
> currently
> >> > signed
> >> > > up online to be CFP reviewers for ApacheCon. At present,
> I’m only
> >> > aware of
> >> > > those who have responded to this e-mail chain.
> >> > >
> >> > > We should be able to find out more in the coming weeks.
> We’re still
> >> > quite
> >> > > early in the process.
> >> > >
> >> > > Thanks for your feedback,
> >> > > Mike
> >> > >
> >> > > On 4/1/18, 9:18 AM, "Ron Wheeler" <
> rwhee...@artifact-software.com>
> >> > wrote:
> >> > >
> >> > > How many people have signed up to be reviewers?
> >> > >
> >> > > I don't think that scheduling is part of the review
> process and
> >> > that
> >> > > can
> >> > > be done by the person/team "organizing" ApacheCon on
> behalf of
> >> > the PMC.
> >> > >
> >> > > To me review is looking at content for
> >> > > - relevance
> >> > > - quality of the presentations (suggest fixes to
> content,
> >> > English,
> >> > > graphics, etc.)
> >> > > This should result in a consensus score
> >> > > - Perfect - ready for prime time
> >> > > - Needs minor changes as documented by the reviewers
> >> > > - Great topic but needs more work - perhaps a reviewer
> could
> >> > volunteer
> >> > > to work with the presenter to 

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Tutkowski, Mike
Will – What do you think? With only 26 presentations, do you think it would be 
reasonable to just ask each reviewer to review each one? One time that I was on 
one of these panels a couple years ago, we each reviewed the roughly dozen 
presentations that were submitted. Of course, people may not be able to spend 
that amount of time on this.

> On Apr 5, 2018, at 7:14 PM, Ron Wheeler  
> wrote:
> 
> We still need to manage the review process and make sure that it is 
> adequately staffed.
> 
> The allocation of presentations to reviewers has to be managed to be sure 
> that the reviewers have the support that they need to do a proper review and 
> that the reviews get done.
> 
> Ron
> 
> 
>> On 05/04/2018 11:45 AM, Tutkowski, Mike wrote:
>> Perfect…then, unless anyone has other opinions they’d like to share on the 
>> topic, let’s follow that approach.
>> 
>> On 4/5/18, 9:43 AM, "Rafael Weingärtner"  wrote:
>> 
>> That is exactly it.
>>  On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike 
>> 
>> wrote:
>>  > Hi Rafael,
>> >
>> > I think as long as we (the CloudStack Community) have the final say on 
>> how
>> > we fill our allotted slots in the CloudStack track of ApacheCon in
>> > Montreal, then it’s perfectly fine for us to leverage Apache’s normal
>> > review process to gather all the feedback from the larger Apache 
>> Community.
>> >
>> > As you say, we could wait for the feedback to come in via that 
>> mechanism
>> > and then, as per Will’s earlier comments, we could advertise on our 
>> users@
>> > and dev@ mailing lists when we plan to get together for a call and make
>> > final decisions on the CFP.
>> >
>> > Is that, in fact, what you were thinking, Rafael?
>> >
>> > Talk to you soon,
>> > Mike
>> >
>> > On 4/4/18, 2:58 PM, "Rafael Weingärtner" 
>> > wrote:
>> >
>> > I think everybody that “raised their hands here” already signed up 
>> to
>> > review.
>> >
>> > Mike, what about if we only gathered the reviews from Apache main
>> > review
>> > system, and then we use that to decide which presentations will 
>> get in
>> > CloudStack tracks? Then, we reduce the work on our side (we also 
>> remove
>> > bias…). I do believe that the review from other peers from Apache
>> > community
>> > (even the one outside from our small community) will be fair and
>> > technical
>> > (meaning, without passion and or favoritism).
>> >
>> > Having said that, I think we only need a small group of PMCs to 
>> gather
>> > the
>> > results and out of the best ranked proposals, we pick the ones to 
>> our
>> > tracks.
>> >
>> > What do you (Mike) and others think?
>> >
>> >
>> > On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike <
>> > mike.tutkow...@netapp.com>
>> > wrote:
>> >
>> > > Hi Ron,
>> > >
>> > > I don’t actually have insight into how many people have currently
>> > signed
>> > > up online to be CFP reviewers for ApacheCon. At present, I’m only
>> > aware of
>> > > those who have responded to this e-mail chain.
>> > >
>> > > We should be able to find out more in the coming weeks. We’re 
>> still
>> > quite
>> > > early in the process.
>> > >
>> > > Thanks for your feedback,
>> > > Mike
>> > >
>> > > On 4/1/18, 9:18 AM, "Ron Wheeler" 
>> 
>> > wrote:
>> > >
>> > > How many people have signed up to be reviewers?
>> > >
>> > > I don't think that scheduling is part of the review process 
>> and
>> > that
>> > > can
>> > > be done by the person/team "organizing" ApacheCon on behalf 
>> of
>> > the PMC.
>> > >
>> > > To me review is looking at content for
>> > > - relevance
>> > > - quality of the presentations (suggest fixes to content,
>> > English,
>> > > graphics, etc.)
>> > > This should result in a consensus score
>> > > - Perfect - ready for prime time
>> > > - Needs minor changes as documented by the reviewers
>> > > - Great topic but needs more work - perhaps a reviewer could
>> > volunteer
>> > > to work with the presenter to get it ready if chosen
>> > > - Not recommended for topic or content reasons
>> > >
>> > > The reviewers could also make non-binding recommendations 
>> about
>> > the
>> > > balance between topics - marketing(why Cloudstack),
>> > > Operations/implementation, Technical details, Roadmap, etc.
>> > based on
>> >

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Ron Wheeler
We still need to manage the review process and make sure that it is 
adequately staffed.


The allocation of presentations to reviewers has to be managed to be 
sure that the reviewers have the support that they need to do a proper 
review and that the reviews get done.


Ron


On 05/04/2018 11:45 AM, Tutkowski, Mike wrote:

Perfect…then, unless anyone has other opinions they’d like to share on the 
topic, let’s follow that approach.

On 4/5/18, 9:43 AM, "Rafael Weingärtner"  wrote:

 That is exactly it.
 
 On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike 

 wrote:
 
 > Hi Rafael,

 >
 > I think as long as we (the CloudStack Community) have the final say on 
how
 > we fill our allotted slots in the CloudStack track of ApacheCon in
 > Montreal, then it’s perfectly fine for us to leverage Apache’s normal
 > review process to gather all the feedback from the larger Apache 
Community.
 >
 > As you say, we could wait for the feedback to come in via that mechanism
 > and then, as per Will’s earlier comments, we could advertise on our 
users@
 > and dev@ mailing lists when we plan to get together for a call and make
 > final decisions on the CFP.
 >
 > Is that, in fact, what you were thinking, Rafael?
 >
 > Talk to you soon,
 > Mike
 >
 > On 4/4/18, 2:58 PM, "Rafael Weingärtner" 
 > wrote:
 >
 > I think everybody that “raised their hands here” already signed up to
 > review.
 >
 > Mike, what about if we only gathered the reviews from Apache main
 > review
 > system, and then we use that to decide which presentations will get 
in
 > CloudStack tracks? Then, we reduce the work on our side (we also 
remove
 > bias…). I do believe that the review from other peers from Apache
 > community
 > (even the one outside from our small community) will be fair and
 > technical
 > (meaning, without passion and or favoritism).
 >
 > Having said that, I think we only need a small group of PMCs to 
gather
 > the
 > results and out of the best ranked proposals, we pick the ones to our
 > tracks.
 >
 > What do you (Mike) and others think?
 >
 >
 > On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike <
 > mike.tutkow...@netapp.com>
 > wrote:
 >
 > > Hi Ron,
 > >
 > > I don’t actually have insight into how many people have currently
 > signed
 > > up online to be CFP reviewers for ApacheCon. At present, I’m only
 > aware of
 > > those who have responded to this e-mail chain.
 > >
 > > We should be able to find out more in the coming weeks. We’re still
 > quite
 > > early in the process.
 > >
 > > Thanks for your feedback,
 > > Mike
 > >
 > > On 4/1/18, 9:18 AM, "Ron Wheeler" 
 > wrote:
 > >
 > > How many people have signed up to be reviewers?
 > >
 > > I don't think that scheduling is part of the review process and
 > that
 > > can
 > > be done by the person/team "organizing" ApacheCon on behalf of
 > the PMC.
 > >
 > > To me review is looking at content for
 > > - relevance
 > > - quality of the presentations (suggest fixes to content,
 > English,
 > > graphics, etc.)
 > > This should result in a consensus score
 > > - Perfect - ready for prime time
 > > - Needs minor changes as documented by the reviewers
 > > - Great topic but needs more work - perhaps a reviewer could
 > volunteer
 > > to work with the presenter to get it ready if chosen
 > > - Not recommended for topic or content reasons
 > >
 > > The reviewers could also make non-binding recommendations about
 > the
 > > balance between topics - marketing(why Cloudstack),
 > > Operations/implementation, Technical details, Roadmap, etc.
 > based on
 > > what they have seen.
 > >
 > > This should be used by the organizers to make the choices and
 > organize
 > > the program.
 > > The organizers have the final say on the choice of 
presentations
 > and
 > > schedule
 > >
 > > Reviewers are there to help the process not control it.
 > >
 > > I would be worried that you do not have enough reviewers rather
 > than
 > > too
 > > many.
 > > Then the work falls on the PMC and organizers.
 > >
 > > When planning meetings, I would recommend that you clearly
 > separate the
 > >  

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread ilya musayev
Andrija

This is a tough scenario.

As an admin, they way i would have handled this situation, is to advertise
the upcoming outage and then take away specific API commands from a user a
day before - so he does not cause any long running async jobs. Once
maintenance completes - enable the API commands back to the user. However -
i dont know who your user base is and if this would be an acceptable
solution.

Perhaps also investigate what can be done to speed up your long running
tasks...

As a side node, we will be working on a feature that would allow for a
graceful termination of the process/job, meaning if agent noticed a
disconnect or termination request - it will abort the command in flight. We
can also consider restarting this tasks again or what not - but it would
not be part of this enhancement.

Regards
ilya

On Thu, Apr 5, 2018 at 6:47 AM, Andrija Panic 
wrote:

> Hi Ilya,
>
> thanks for the feedback - but in "real world", you need to "understand"
> that 60min is next to useless timeout for some jobs (if I understand this
> specific parameter correctly ?? - job is really canceled, not only job
> monitoring is canceled ???) -
>
> My value for the  "job.cancel.threshold.minutes" is 2880 minutes (2 days?)
>
> I can tell you when you have CEPH/NFS (CEPH even "worse" case, since slower
> read durign qemu-img convert process...) of 500GB, then imagine snapshot
> job will take many hours. Should I mention 1TB volumes (yes, we had
> client's like that...)
> Than attaching 1TB volume, that was uploaded to ACS (lives originally on
> Secondary Storage, and takes time to be copied over to NFS/CEPH) will take
> up to few hours.
> Then migrating 1TB volume from NFS to CEPH, or CEPH to NFS, also takes
> time...etc.
>
> I'm just giving you feedback as "user", admin of the cloud, zero DEV skills
> here :) , just to make sure you make practical decisions (and I admit I
> might be wrong with my stuff, but just giving you feedback from our public
> cloud setup)
>
>
> Cheers!
>
>
>
>
> On 5 April 2018 at 15:16, Tutkowski, Mike 
> wrote:
>
> > Wow, there’s been a lot of good details noted from several people on how
> > this process works today and how we’d like it to work in the near future.
> >
> > 1) Any chance this is already documented on the Wiki?
> >
> > 2) If not, any chance someone would be willing to do so (a flow diagram
> > would be particularly useful).
> >
> > > On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier 
> > wrote:
> > >
> > > Hi all,
> > >
> > > Good point ilya but as stated by Sergey there's more thing to consider
> > > before being able to do a proper shutdown. I augmented my script I gave
> > you
> > > originally and changed code in CS. What we're doing for our environment
> > is
> > > as follow:
> > >
> > > 1. the MGMT looks for a change in the file /etc/lb-agent which contains
> > > keywords for HAproxy[2] (ready, maint) so that HA-proxy can disable the
> > > mgmt on the keyword "maint" and the mgmt server stops a couple of
> > > threads[1] to stop processing async jobs in the queue
> > > 2. Looks for the async jobs and wait until there is none to ensure you
> > can
> > > send the reconnect commands (if jobs are running, a reconnect will
> result
> > > in a failed job since the result will never reach the management
> server -
> > > the agent waits for the current job to be done before reconnecting, and
> > > discard the result... rooms for improvement here!)
> > > 3. Issue a reconnectHost command to all the hosts connected to the mgmt
> > > server so that they reconnect to another one, otherwise the mgmt must
> be
> > up
> > > since it is used to forward commands to agents.
> > > 4. when all agents are reconnected, we can shutdown the management
> server
> > > and perform the maintenance.
> > >
> > > One issue remains for me, during the reconnect, the commands that are
> > > processed at the same time should be kept in a queue until the agents
> > have
> > > finished any current jobs and have reconnected. Today the little time
> > > window during which the reconnect happens can lead to failed jobs due
> to
> > > the agent not being connected at the right moment.
> > >
> > > I could push a PR for the change to stop some processing threads based
> on
> > > the content of a file. It's possible also to cancel the drain of the
> > > management by simply changing the content of the file back to "ready"
> > > again, instead of "maint" [2].
> > >
> > > [1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector
> > > [2] HA proxy documentation on agent checker: https://cbonte.github.io/
> > > haproxy-dconv/1.6/configuration.html#5.2-agent-check
> > >
> > > Regarding your issue on the port blocking, I think it's fair to
> consider
> > > that if you want to shutdown your server at some point, you have to
> stop
> > > serving (some) requests. Here the only way it's to stop serving
> > everything.
> > > If the API had a REST design, we 

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread ilya musayev
After much useful input from many of you - i realize my approach is
somewhat incomplete and possible very optimistic.

Speaking to Marcus, here is what we propose as alternate solution, i was
hoping to stay outside of the "core" - but it looks like there is no other
away around it.

Proposed functionality: Management Server functional to prepare for
maintenance
* i'm thinking this should be applicable to multinode setup only
drain all connection on 8250 for KVM and Other agents - by issuing a
reconnect command on agents
while 8250 is still listening, a new attempt to connect will be blocked and
agent will be asked to reconnect (if you have LB - it will route it to
another node and eventually reconnect all agents to other nodes - this
might be an area where Marc's HAProxy solution would plugin). In 4.11 -
there is a new framework for managing agent connectivity without needing
Load Balancer, need to investigate how this will work.
allow the existing running async tasks to complete - as per
"job.cancel.threshold.minutes"
max value
queue the new tasks and process them on the next management server

Still dont know what will happen to Xen or VMware in this case - perhaps
ShapeBlue team can help answer or fill in the blanks for us.

Regards,
ilya

On Thu, Apr 5, 2018 at 2:48 PM, ilya musayev 
wrote:

> Hi Sergey
>
> Glad to see you are doing well,
>
> I was gonna say drop "enterprise virtualization company" and save a
> $fortune$ - but its not for everyone :)
>
> I'll post another proposed solution to bottom of this thread.
>
> Regards
> ilya
>
>
> On Wed, Apr 4, 2018 at 5:22 PM, Sergey Levitskiy 
> wrote:
>
>> Now without spellchecking :)
>>
>> This is not simple e.g. for VMware. Each management server also acts as
>> an agent proxy so tasks against a particular ESX host will be always
>> forwarded. That right answer will be to support a native “maintenance mode”
>> for management server. When entered to such mode the management server
>> should release all agents including SSVM, block/redirect API calls and
>> login request and finish all async job it originated.
>>
>>
>>
>> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy  serg...@hotmail.com>> wrote:
>>
>> This is not simple e.g. for VMware. Each management server also acts as
>> an agent proxy so tasks against a particular ESX host will be always
>> forwarded. That right answer will be to a native support for “maintenance
>> mode” for management server. When entered to such mode the management
>> server should release all agents including save, block/redirect API calls
>> and login request and finish all a sync job it originated.
>>
>> Sent from my iPhone
>>
>> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
>> rafaelweingart...@gmail.com> wrote:
>>
>> Ilya, still regarding the management server that is being shut down issue;
>> if other MSs/or maybe system VMs (I am not sure to know if they are able
>> to
>> do such tasks) can direct/redirect/send new jobs to this management server
>> (the one being shut down), the process might never end because new tasks
>> are always being created for the management server that we want to shut
>> down. Is this scenario possible?
>>
>> That is why I mentioned blocking the port 8250 for the
>> “graceful-shutdown”.
>>
>> If this scenario is not possible, then everything s fine.
>>
>>
>> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <
>> ilya.mailing.li...@gmail.com>
>> wrote:
>>
>> I'm thinking of using a configuration from "job.cancel.threshold.minutes"
>> -
>> it will be the longest
>>
>> "category": "Advanced",
>>
>> "description": "Time (in minutes) for async-jobs to be forcely
>> cancelled if it has been in process for long",
>>
>> "name": "job.cancel.threshold.minutes",
>>
>> "value": "60"
>>
>>
>>
>>
>> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
>> rafaelweingart...@gmail.com> wrote:
>>
>> Big +1 for this feature; I only have a few doubts.
>>
>> * Regarding the tasks/jobs that management servers (MSs) execute; are
>> these
>> tasks originate from requests that come to the MS, or is it possible that
>> requests received by one management server to be executed by other? I
>> mean,
>> if I execute a request against MS1, will this request always be
>> executed/threated by MS1, or is it possible that this request is executed
>> by another MS (e.g. MS2)?
>>
>> * I would suggest that after we block traffic coming from
>> 8080/8443/8250(we
>> will need to block this as well right?), we can log the execution of
>> tasks.
>> I mean, something saying, there are XXX tasks (enumerate tasks) still
>> being
>> executed, we will wait for them to finish before shutting down.
>>
>> * The timeout (60 minutes suggested) could be global settings that we can
>> load before executing the graceful-shutdown.
>>
>> On Wed, Apr 4, 

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread ilya musayev
Hi Sergey

Glad to see you are doing well,

I was gonna say drop "enterprise virtualization company" and save a
$fortune$ - but its not for everyone :)

I'll post another proposed solution to bottom of this thread.

Regards
ilya


On Wed, Apr 4, 2018 at 5:22 PM, Sergey Levitskiy 
wrote:

> Now without spellchecking :)
>
> This is not simple e.g. for VMware. Each management server also acts as an
> agent proxy so tasks against a particular ESX host will be always
> forwarded. That right answer will be to support a native “maintenance mode”
> for management server. When entered to such mode the management server
> should release all agents including SSVM, block/redirect API calls and
> login request and finish all async job it originated.
>
>
>
> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy > wrote:
>
> This is not simple e.g. for VMware. Each management server also acts as an
> agent proxy so tasks against a particular ESX host will be always
> forwarded. That right answer will be to a native support for “maintenance
> mode” for management server. When entered to such mode the management
> server should release all agents including save, block/redirect API calls
> and login request and finish all a sync job it originated.
>
> Sent from my iPhone
>
> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> Ilya, still regarding the management server that is being shut down issue;
> if other MSs/or maybe system VMs (I am not sure to know if they are able to
> do such tasks) can direct/redirect/send new jobs to this management server
> (the one being shut down), the process might never end because new tasks
> are always being created for the management server that we want to shut
> down. Is this scenario possible?
>
> That is why I mentioned blocking the port 8250 for the “graceful-shutdown”.
>
> If this scenario is not possible, then everything s fine.
>
>
> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev  >
> wrote:
>
> I'm thinking of using a configuration from "job.cancel.threshold.minutes" -
> it will be the longest
>
> "category": "Advanced",
>
> "description": "Time (in minutes) for async-jobs to be forcely
> cancelled if it has been in process for long",
>
> "name": "job.cancel.threshold.minutes",
>
> "value": "60"
>
>
>
>
> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> Big +1 for this feature; I only have a few doubts.
>
> * Regarding the tasks/jobs that management servers (MSs) execute; are
> these
> tasks originate from requests that come to the MS, or is it possible that
> requests received by one management server to be executed by other? I
> mean,
> if I execute a request against MS1, will this request always be
> executed/threated by MS1, or is it possible that this request is executed
> by another MS (e.g. MS2)?
>
> * I would suggest that after we block traffic coming from
> 8080/8443/8250(we
> will need to block this as well right?), we can log the execution of
> tasks.
> I mean, something saying, there are XXX tasks (enumerate tasks) still
> being
> executed, we will wait for them to finish before shutting down.
>
> * The timeout (60 minutes suggested) could be global settings that we can
> load before executing the graceful-shutdown.
>
> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
> ilya.mailing.li...@gmail.com
>
> wrote:
>
> Use case:
> In any environment - time to time - administrator needs to perform a
> maintenance. Current stop sequence of cloudstack management server will
> ignore the fact that there may be long running async jobs - and
> terminate
> the process. This in turn can create a poor user experience and
> occasional
> inconsistency  in cloudstack db.
>
> This is especially painful in large environments where the user has
> thousands of nodes and there is a continuous patching that happens
> around
> the clock - that requires migration of workload from one node to
> another.
>
> With that said - i've created a script that monitors the async job
> queue
> for given MS and waits for it complete all jobs. More details are
> posted
> below.
>
> I'd like to introduce "graceful-shutdown" into the systemctl/service of
> cloudstack-management service.
>
> The details of how it will work is below:
>
> Workflow for graceful shutdown:
> Using iptables/firewalld - block any connection attempts on 8080/8443
> (we
> can identify the ports dynamically)
> Identify the MSID for the node, using the proper msid - query
> async_job
> table for
> 1) any jobs that are still running (or job_status=“0”)
> 2) job_dispatcher not like “pseudoJobDispatcher"
> 3) job_init_msid=$my_ms_id
>
> Monitor this async_job table for 60 minutes - until all async jobs for
> MSID
> are 

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread ilya musayev
Marc

Thank you posting the details on how your implementation works.
Unfortunately for us - HAproxy is not an option - hence we cant take
advantage of this implementation, but please do share with the community -
perhaps it will help someone else.

I'm going to post to the bottom of this thread with new proposed solution.

Regards
ilya

On Thu, Apr 5, 2018 at 2:36 AM, Marc-Aurèle Brothier 
wrote:

> Hi all,
>
> Good point ilya but as stated by Sergey there's more thing to consider
> before being able to do a proper shutdown. I augmented my script I gave you
> originally and changed code in CS. What we're doing for our environment is
> as follow:
>
> 1. the MGMT looks for a change in the file /etc/lb-agent which contains
> keywords for HAproxy[2] (ready, maint) so that HA-proxy can disable the
> mgmt on the keyword "maint" and the mgmt server stops a couple of
> threads[1] to stop processing async jobs in the queue
> 2. Looks for the async jobs and wait until there is none to ensure you can
> send the reconnect commands (if jobs are running, a reconnect will result
> in a failed job since the result will never reach the management server -
> the agent waits for the current job to be done before reconnecting, and
> discard the result... rooms for improvement here!)
> 3. Issue a reconnectHost command to all the hosts connected to the mgmt
> server so that they reconnect to another one, otherwise the mgmt must be up
> since it is used to forward commands to agents.
> 4. when all agents are reconnected, we can shutdown the management server
> and perform the maintenance.
>
> One issue remains for me, during the reconnect, the commands that are
> processed at the same time should be kept in a queue until the agents have
> finished any current jobs and have reconnected. Today the little time
> window during which the reconnect happens can lead to failed jobs due to
> the agent not being connected at the right moment.
>
> I could push a PR for the change to stop some processing threads based on
> the content of a file. It's possible also to cancel the drain of the
> management by simply changing the content of the file back to "ready"
> again, instead of "maint" [2].
>
> [1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector
> [2] HA proxy documentation on agent checker: https://cbonte.github.io/
> haproxy-dconv/1.6/configuration.html#5.2-agent-check
>
> Regarding your issue on the port blocking, I think it's fair to consider
> that if you want to shutdown your server at some point, you have to stop
> serving (some) requests. Here the only way it's to stop serving everything.
> If the API had a REST design, we could reject any POST/PUT/DELETE
> operations and allow GET ones. I don't know how hard it would be today to
> only allow listBaseCmd operations to be more friendly with the users.
>
> Marco
>
>
> On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy 
> wrote:
>
> > Now without spellchecking :)
> >
> > This is not simple e.g. for VMware. Each management server also acts as
> an
> > agent proxy so tasks against a particular ESX host will be always
> > forwarded. That right answer will be to support a native “maintenance
> mode”
> > for management server. When entered to such mode the management server
> > should release all agents including SSVM, block/redirect API calls and
> > login request and finish all async job it originated.
> >
> >
> >
> > On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy   > serg...@hotmail.com>> wrote:
> >
> > This is not simple e.g. for VMware. Each management server also acts as
> an
> > agent proxy so tasks against a particular ESX host will be always
> > forwarded. That right answer will be to a native support for “maintenance
> > mode” for management server. When entered to such mode the management
> > server should release all agents including save, block/redirect API calls
> > and login request and finish all a sync job it originated.
> >
> > Sent from my iPhone
> >
> > On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
> > rafaelweingart...@gmail.com> wrote:
> >
> > Ilya, still regarding the management server that is being shut down
> issue;
> > if other MSs/or maybe system VMs (I am not sure to know if they are able
> to
> > do such tasks) can direct/redirect/send new jobs to this management
> server
> > (the one being shut down), the process might never end because new tasks
> > are always being created for the management server that we want to shut
> > down. Is this scenario possible?
> >
> > That is why I mentioned blocking the port 8250 for the
> “graceful-shutdown”.
> >
> > If this scenario is not possible, then everything s fine.
> >
> >
> > On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <
> ilya.mailing.li...@gmail.com
> > >
> > wrote:
> >
> > I'm thinking of using a configuration from
> "job.cancel.threshold.minutes" -
> > it will be the 

Re: System VM Template

2018-04-05 Thread Rafael Weingärtner
I am using this template for system VMs:
http://download.cloudstack.org/systemvm/4.11/systemvmtemplate-4.11.0-xen.vhd.bz2
And, right now, the ACS version I am using was built using the branch of
this PR: https://github.com/apache/cloudstack/pull/2524. Everything seems
to be just fine here.

Could you get some details regarding the VR status that ACS is seeing?

On Thu, Apr 5, 2018 at 1:56 PM, Tutkowski, Mike 
wrote:

> Thanks for your feedback, Rafael.
>
> I re-created my 4.12 cloud today (after fetching the latest code and using
> the master branch) and still seem to be having trouble with the VR. The
> hypervisor type I’m using here is XenServer 6.5.
>
> When I examine the VR in the CloudStack GUI, the “Requires Upgrade” column
> says, “Yes”. However, when I try to initiate the upgrade, I get an error
> message stating that the VR is not in the proper state (because it’s stuck
> in the Starting state).
>
> The system VM template I am working with is the following:
> http://cloudstack.apt-get.eu/systemvm/4.11/
>
> In case anyone sees something, I’ve included the contents of my VR’s
> cloud.log file below.
>
> Thanks!
>
> Thu Apr  5 16:45:01 UTC 2018 Executing cloud-early-config
> Thu Apr  5 16:45:01 UTC 2018 Detected that we are running inside xen-domU
> Thu Apr  5 16:45:02 UTC 2018 Scripts checksum detected: oldmd5=
> 60703a62ef9d1666975ec0a8ce421270 newmd5=7f8c303cd3303ff902e7ad9f3f1f092b
> Thu Apr  5 16:45:02 UTC 2018 Patched scripts using
> /media/cdrom/cloud-scripts.tgz
> Thu Apr  5 16:45:02 UTC 2018 Patching cloud service
> Thu Apr  5 16:45:02 UTC 2018 Configuring systemvm type=dhcpsrvr
> Thu Apr  5 16:45:02 UTC 2018 Setting up dhcp server system vm
> Thu Apr  5 16:45:04 UTC 2018 Setting up dnsmasq
> Thu Apr  5 16:45:05 UTC 2018 Setting up apache web server
> Thu Apr  5 16:45:05 UTC 2018 Processors = 1  Enable service  = 0
> Thu Apr  5 16:45:05 UTC 2018 cloud: enable_fwding = 0
> Thu Apr  5 16:45:05 UTC 2018 enable_fwding = 0
> Thu Apr  5 16:45:05 UTC 2018 Finished setting up systemvm
> 2018-04-05 16:45:05,924  merge.py load:296 Continuing with the processing
> of file '/var/cache/cloud/cmd_line.json'
> 2018-04-05 16:45:05,927  merge.py process:101 Command of type cmdline
> received
> 2018-04-05 16:45:05,928  merge.py process:101 Command of type ips received
> 2018-04-05 16:45:05,929  merge.py process:101 Command of type ips received
> 2018-04-05 16:45:05,930  CsHelper.py execute:188 Executing: ip addr show
> dev eth1
> 2018-04-05 16:45:05,941  CsHelper.py execute:188 Executing: ip addr show
> dev eth0
> 2018-04-05 16:45:05,950  CsHelper.py execute:188 Executing: ip addr show
> dev eth1
> 2018-04-05 16:45:05,958  CsAddress.py process:108 Address found in DataBag
> ==> {u'public_ip': u'169.254.3.171', u'one_to_one_nat': False,
> u'nic_dev_id': u'1', u'network': u'169.254.0.0/16', u'netmask':
> u'255.255.0.0', u'source_nat': False, u'broadcast': u'169.254.255.255',
> u'add': True, u'nw_type': u'control', u'device': u'eth1', u'cidr': u'
> 169.254.3.171/16', u'gateway': u'None', u'size': u'16'}
> 2018-04-05 16:45:05,959  CsAddress.py process:116 Address 169.254.3.171/16
> on device eth1 already configured
> 2018-04-05 16:45:05,959  CsRoute.py defaultroute_exists:103 Checking if
> default ipv4 route is present
> 2018-04-05 16:45:05,959  CsHelper.py execute:188 Executing: ip -4 route
> list 0/0
> 2018-04-05 16:45:05,967  CsRoute.py defaultroute_exists:107 Default route
> found: default via 10.117.40.126 dev eth0
> 2018-04-05 16:45:05,967  CsHelper.py execute:188 Executing: ip addr show
> dev eth0
> 2018-04-05 16:45:05,976  CsAddress.py process:108 Address found in DataBag
> ==> {u'public_ip': u'10.117.40.33', u'one_to_one_nat': False,
> u'nic_dev_id': u'0', u'network': u'10.117.40.0/25', u'netmask':
> u'255.255.255.128', u'source_nat': False, u'broadcast': u'10.117.40.127',
> u'add': True, u'nw_type': u'guest', u'device': u'eth0', u'cidr': u'
> 10.117.40.33/25', u'gateway': u'None', u'size': u'25'}
> 2018-04-05 16:45:05,976  CsAddress.py process:116 Address 10.117.40.33/25
> on device eth0 already configured
> 2018-04-05 16:45:05,976  CsRoute.py add_table:37 Adding route table: 0
> Table_eth0 to /etc/iproute2/rt_tables if not present
> 2018-04-05 16:45:05,978  CsHelper.py execute:188 Executing: sudo echo 0
> Table_eth0 >> /etc/iproute2/rt_tables
> 2018-04-05 16:45:06,015  CsHelper.py execute:188 Executing: ip rule show
> 2018-04-05 16:45:06,026  CsHelper.py execute:188 Executing: ip rule show
> 2018-04-05 16:45:06,034  CsHelper.py execute:188 Executing: ip rule add
> fwmark 0 table Table_eth0
> 2018-04-05 16:45:06,042  CsRule.py addMark:49 Added fwmark rule for
> Table_eth0
> 2018-04-05 16:45:06,043  CsHelper.py execute:188 Executing: ip link show
> eth0 | grep 'state DOWN'
> 2018-04-05 16:45:06,053  CsHelper.py execute:193 Command 'ip link show
> eth0 | grep 'state DOWN'' returned non-zero exit status 1
> 2018-04-05 16:45:06,053  CsHelper.py execute:188 Executing: arping -c 1 

Re: System VM Template

2018-04-05 Thread Tutkowski, Mike
OK, wait a second. :)

It works now. It just took a longer time than normal.

When I examine the VR in the GUI, it no longer says it requires an upgrade and 
has transitioned to the Running state.

It usually only takes a minute or so for it to come up and get into the Running 
state. It took about 10 minutes in this case, but it did end up working.

On 4/5/18, 10:56 AM, "Tutkowski, Mike"  wrote:

Thanks for your feedback, Rafael.

I re-created my 4.12 cloud today (after fetching the latest code and using 
the master branch) and still seem to be having trouble with the VR. The 
hypervisor type I’m using here is XenServer 6.5.

When I examine the VR in the CloudStack GUI, the “Requires Upgrade” column 
says, “Yes”. However, when I try to initiate the upgrade, I get an error 
message stating that the VR is not in the proper state (because it’s stuck in 
the Starting state).

The system VM template I am working with is the following: 
http://cloudstack.apt-get.eu/systemvm/4.11/

In case anyone sees something, I’ve included the contents of my VR’s 
cloud.log file below.

Thanks!

Thu Apr  5 16:45:01 UTC 2018 Executing cloud-early-config
Thu Apr  5 16:45:01 UTC 2018 Detected that we are running inside xen-domU
Thu Apr  5 16:45:02 UTC 2018 Scripts checksum detected: 
oldmd5=60703a62ef9d1666975ec0a8ce421270 newmd5=7f8c303cd3303ff902e7ad9f3f1f092b
Thu Apr  5 16:45:02 UTC 2018 Patched scripts using 
/media/cdrom/cloud-scripts.tgz
Thu Apr  5 16:45:02 UTC 2018 Patching cloud service
Thu Apr  5 16:45:02 UTC 2018 Configuring systemvm type=dhcpsrvr
Thu Apr  5 16:45:02 UTC 2018 Setting up dhcp server system vm
Thu Apr  5 16:45:04 UTC 2018 Setting up dnsmasq
Thu Apr  5 16:45:05 UTC 2018 Setting up apache web server
Thu Apr  5 16:45:05 UTC 2018 Processors = 1  Enable service  = 0
Thu Apr  5 16:45:05 UTC 2018 cloud: enable_fwding = 0
Thu Apr  5 16:45:05 UTC 2018 enable_fwding = 0
Thu Apr  5 16:45:05 UTC 2018 Finished setting up systemvm
2018-04-05 16:45:05,924  merge.py load:296 Continuing with the processing 
of file '/var/cache/cloud/cmd_line.json'
2018-04-05 16:45:05,927  merge.py process:101 Command of type cmdline 
received
2018-04-05 16:45:05,928  merge.py process:101 Command of type ips received
2018-04-05 16:45:05,929  merge.py process:101 Command of type ips received
2018-04-05 16:45:05,930  CsHelper.py execute:188 Executing: ip addr show 
dev eth1
2018-04-05 16:45:05,941  CsHelper.py execute:188 Executing: ip addr show 
dev eth0
2018-04-05 16:45:05,950  CsHelper.py execute:188 Executing: ip addr show 
dev eth1
2018-04-05 16:45:05,958  CsAddress.py process:108 Address found in DataBag 
==> {u'public_ip': u'169.254.3.171', u'one_to_one_nat': False, u'nic_dev_id': 
u'1', u'network': u'169.254.0.0/16', u'netmask': u'255.255.0.0', u'source_nat': 
False, u'broadcast': u'169.254.255.255', u'add': True, u'nw_type': u'control', 
u'device': u'eth1', u'cidr': u'169.254.3.171/16', u'gateway': u'None', u'size': 
u'16'}
2018-04-05 16:45:05,959  CsAddress.py process:116 Address 169.254.3.171/16 
on device eth1 already configured
2018-04-05 16:45:05,959  CsRoute.py defaultroute_exists:103 Checking if 
default ipv4 route is present
2018-04-05 16:45:05,959  CsHelper.py execute:188 Executing: ip -4 route 
list 0/0
2018-04-05 16:45:05,967  CsRoute.py defaultroute_exists:107 Default route 
found: default via 10.117.40.126 dev eth0 
2018-04-05 16:45:05,967  CsHelper.py execute:188 Executing: ip addr show 
dev eth0
2018-04-05 16:45:05,976  CsAddress.py process:108 Address found in DataBag 
==> {u'public_ip': u'10.117.40.33', u'one_to_one_nat': False, u'nic_dev_id': 
u'0', u'network': u'10.117.40.0/25', u'netmask': u'255.255.255.128', 
u'source_nat': False, u'broadcast': u'10.117.40.127', u'add': True, u'nw_type': 
u'guest', u'device': u'eth0', u'cidr': u'10.117.40.33/25', u'gateway': u'None', 
u'size': u'25'}
2018-04-05 16:45:05,976  CsAddress.py process:116 Address 10.117.40.33/25 
on device eth0 already configured
2018-04-05 16:45:05,976  CsRoute.py add_table:37 Adding route table: 0 
Table_eth0 to /etc/iproute2/rt_tables if not present 
2018-04-05 16:45:05,978  CsHelper.py execute:188 Executing: sudo echo 0 
Table_eth0 >> /etc/iproute2/rt_tables
2018-04-05 16:45:06,015  CsHelper.py execute:188 Executing: ip rule show
2018-04-05 16:45:06,026  CsHelper.py execute:188 Executing: ip rule show
2018-04-05 16:45:06,034  CsHelper.py execute:188 Executing: ip rule add 
fwmark 0 table Table_eth0
2018-04-05 16:45:06,042  CsRule.py addMark:49 Added fwmark rule for 
Table_eth0
2018-04-05 16:45:06,043  CsHelper.py execute:188 Executing: ip link show 
eth0 | grep 'state DOWN'
2018-04-05 16:45:06,053  CsHelper.py execute:193 Command 'ip link show eth0 
| grep 'state DOWN'' returned non-zero exit status 1
2018-04-05 16:45:06,053  

Re: System VM Template

2018-04-05 Thread Tutkowski, Mike
Thanks for your feedback, Rafael.

I re-created my 4.12 cloud today (after fetching the latest code and using the 
master branch) and still seem to be having trouble with the VR. The hypervisor 
type I’m using here is XenServer 6.5.

When I examine the VR in the CloudStack GUI, the “Requires Upgrade” column 
says, “Yes”. However, when I try to initiate the upgrade, I get an error 
message stating that the VR is not in the proper state (because it’s stuck in 
the Starting state).

The system VM template I am working with is the following: 
http://cloudstack.apt-get.eu/systemvm/4.11/

In case anyone sees something, I’ve included the contents of my VR’s cloud.log 
file below.

Thanks!

Thu Apr  5 16:45:01 UTC 2018 Executing cloud-early-config
Thu Apr  5 16:45:01 UTC 2018 Detected that we are running inside xen-domU
Thu Apr  5 16:45:02 UTC 2018 Scripts checksum detected: 
oldmd5=60703a62ef9d1666975ec0a8ce421270 newmd5=7f8c303cd3303ff902e7ad9f3f1f092b
Thu Apr  5 16:45:02 UTC 2018 Patched scripts using 
/media/cdrom/cloud-scripts.tgz
Thu Apr  5 16:45:02 UTC 2018 Patching cloud service
Thu Apr  5 16:45:02 UTC 2018 Configuring systemvm type=dhcpsrvr
Thu Apr  5 16:45:02 UTC 2018 Setting up dhcp server system vm
Thu Apr  5 16:45:04 UTC 2018 Setting up dnsmasq
Thu Apr  5 16:45:05 UTC 2018 Setting up apache web server
Thu Apr  5 16:45:05 UTC 2018 Processors = 1  Enable service  = 0
Thu Apr  5 16:45:05 UTC 2018 cloud: enable_fwding = 0
Thu Apr  5 16:45:05 UTC 2018 enable_fwding = 0
Thu Apr  5 16:45:05 UTC 2018 Finished setting up systemvm
2018-04-05 16:45:05,924  merge.py load:296 Continuing with the processing of 
file '/var/cache/cloud/cmd_line.json'
2018-04-05 16:45:05,927  merge.py process:101 Command of type cmdline received
2018-04-05 16:45:05,928  merge.py process:101 Command of type ips received
2018-04-05 16:45:05,929  merge.py process:101 Command of type ips received
2018-04-05 16:45:05,930  CsHelper.py execute:188 Executing: ip addr show dev 
eth1
2018-04-05 16:45:05,941  CsHelper.py execute:188 Executing: ip addr show dev 
eth0
2018-04-05 16:45:05,950  CsHelper.py execute:188 Executing: ip addr show dev 
eth1
2018-04-05 16:45:05,958  CsAddress.py process:108 Address found in DataBag ==> 
{u'public_ip': u'169.254.3.171', u'one_to_one_nat': False, u'nic_dev_id': u'1', 
u'network': u'169.254.0.0/16', u'netmask': u'255.255.0.0', u'source_nat': 
False, u'broadcast': u'169.254.255.255', u'add': True, u'nw_type': u'control', 
u'device': u'eth1', u'cidr': u'169.254.3.171/16', u'gateway': u'None', u'size': 
u'16'}
2018-04-05 16:45:05,959  CsAddress.py process:116 Address 169.254.3.171/16 on 
device eth1 already configured
2018-04-05 16:45:05,959  CsRoute.py defaultroute_exists:103 Checking if default 
ipv4 route is present
2018-04-05 16:45:05,959  CsHelper.py execute:188 Executing: ip -4 route list 0/0
2018-04-05 16:45:05,967  CsRoute.py defaultroute_exists:107 Default route 
found: default via 10.117.40.126 dev eth0 
2018-04-05 16:45:05,967  CsHelper.py execute:188 Executing: ip addr show dev 
eth0
2018-04-05 16:45:05,976  CsAddress.py process:108 Address found in DataBag ==> 
{u'public_ip': u'10.117.40.33', u'one_to_one_nat': False, u'nic_dev_id': u'0', 
u'network': u'10.117.40.0/25', u'netmask': u'255.255.255.128', u'source_nat': 
False, u'broadcast': u'10.117.40.127', u'add': True, u'nw_type': u'guest', 
u'device': u'eth0', u'cidr': u'10.117.40.33/25', u'gateway': u'None', u'size': 
u'25'}
2018-04-05 16:45:05,976  CsAddress.py process:116 Address 10.117.40.33/25 on 
device eth0 already configured
2018-04-05 16:45:05,976  CsRoute.py add_table:37 Adding route table: 0 
Table_eth0 to /etc/iproute2/rt_tables if not present 
2018-04-05 16:45:05,978  CsHelper.py execute:188 Executing: sudo echo 0 
Table_eth0 >> /etc/iproute2/rt_tables
2018-04-05 16:45:06,015  CsHelper.py execute:188 Executing: ip rule show
2018-04-05 16:45:06,026  CsHelper.py execute:188 Executing: ip rule show
2018-04-05 16:45:06,034  CsHelper.py execute:188 Executing: ip rule add fwmark 
0 table Table_eth0
2018-04-05 16:45:06,042  CsRule.py addMark:49 Added fwmark rule for Table_eth0
2018-04-05 16:45:06,043  CsHelper.py execute:188 Executing: ip link show eth0 | 
grep 'state DOWN'
2018-04-05 16:45:06,053  CsHelper.py execute:193 Command 'ip link show eth0 | 
grep 'state DOWN'' returned non-zero exit status 1
2018-04-05 16:45:06,053  CsHelper.py execute:188 Executing: arping -c 1 -I eth0 
-A -U -s 10.117.40.33 None
2018-04-05 16:45:06,066  CsHelper.py execute:193 Command 'arping -c 1 -I eth0 
-A -U -s 10.117.40.33 None' returned non-zero exit status 2
2018-04-05 16:45:06,067  CsRoute.py add_network_route:64 Adding route: dev eth0 
table: Table_eth0 network: 10.117.40.0/25 if not present
2018-04-05 16:45:06,067  CsHelper.py execute:188 Executing: ip route show dev 
eth0 table Table_eth0 throw 10.117.40.0/25 proto static
2018-04-05 16:45:06,075  CsHelper.py execute:193 Command 'ip route show dev 
eth0 table Table_eth0 throw 10.117.40.0/25 proto 

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Tutkowski, Mike
Perfect…then, unless anyone has other opinions they’d like to share on the 
topic, let’s follow that approach.

On 4/5/18, 9:43 AM, "Rafael Weingärtner"  wrote:

That is exactly it.

On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike 
wrote:

> Hi Rafael,
>
> I think as long as we (the CloudStack Community) have the final say on how
> we fill our allotted slots in the CloudStack track of ApacheCon in
> Montreal, then it’s perfectly fine for us to leverage Apache’s normal
> review process to gather all the feedback from the larger Apache 
Community.
>
> As you say, we could wait for the feedback to come in via that mechanism
> and then, as per Will’s earlier comments, we could advertise on our users@
> and dev@ mailing lists when we plan to get together for a call and make
> final decisions on the CFP.
>
> Is that, in fact, what you were thinking, Rafael?
>
> Talk to you soon,
> Mike
>
> On 4/4/18, 2:58 PM, "Rafael Weingärtner" 
> wrote:
>
> I think everybody that “raised their hands here” already signed up to
> review.
>
> Mike, what about if we only gathered the reviews from Apache main
> review
> system, and then we use that to decide which presentations will get in
> CloudStack tracks? Then, we reduce the work on our side (we also 
remove
> bias…). I do believe that the review from other peers from Apache
> community
> (even the one outside from our small community) will be fair and
> technical
> (meaning, without passion and or favoritism).
>
> Having said that, I think we only need a small group of PMCs to gather
> the
> results and out of the best ranked proposals, we pick the ones to our
> tracks.
>
> What do you (Mike) and others think?
>
>
> On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike <
> mike.tutkow...@netapp.com>
> wrote:
>
> > Hi Ron,
> >
> > I don’t actually have insight into how many people have currently
> signed
> > up online to be CFP reviewers for ApacheCon. At present, I’m only
> aware of
> > those who have responded to this e-mail chain.
> >
> > We should be able to find out more in the coming weeks. We’re still
> quite
> > early in the process.
> >
> > Thanks for your feedback,
> > Mike
> >
> > On 4/1/18, 9:18 AM, "Ron Wheeler" 
> wrote:
> >
> > How many people have signed up to be reviewers?
> >
> > I don't think that scheduling is part of the review process and
> that
> > can
> > be done by the person/team "organizing" ApacheCon on behalf of
> the PMC.
> >
> > To me review is looking at content for
> > - relevance
> > - quality of the presentations (suggest fixes to content,
> English,
> > graphics, etc.)
> > This should result in a consensus score
> > - Perfect - ready for prime time
> > - Needs minor changes as documented by the reviewers
> > - Great topic but needs more work - perhaps a reviewer could
> volunteer
> > to work with the presenter to get it ready if chosen
> > - Not recommended for topic or content reasons
> >
> > The reviewers could also make non-binding recommendations about
> the
> > balance between topics - marketing(why Cloudstack),
> > Operations/implementation, Technical details, Roadmap, etc.
> based on
> > what they have seen.
> >
> > This should be used by the organizers to make the choices and
> organize
> > the program.
> > The organizers have the final say on the choice of presentations
> and
> > schedule
> >
> > Reviewers are there to help the process not control it.
> >
> > I would be worried that you do not have enough reviewers rather
> than
> > too
> > many.
> > Then the work falls on the PMC and organizers.
> >
> > When planning meetings, I would recommend that you clearly
> separate the
> > roles and only invite the reviewers to the meetings about
> review. Get
> > the list of presentation to present to the reviewers and decide
> if
> > there
> > are any instructions that you want to give to reviewers.
> > I would recommend that you keep the organizing group small.
> Membership
> > should be set by the PMC and should be people that are committed
> to the
  

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Rafael Weingärtner
That is exactly it.

On Thu, Apr 5, 2018 at 12:37 PM, Tutkowski, Mike 
wrote:

> Hi Rafael,
>
> I think as long as we (the CloudStack Community) have the final say on how
> we fill our allotted slots in the CloudStack track of ApacheCon in
> Montreal, then it’s perfectly fine for us to leverage Apache’s normal
> review process to gather all the feedback from the larger Apache Community.
>
> As you say, we could wait for the feedback to come in via that mechanism
> and then, as per Will’s earlier comments, we could advertise on our users@
> and dev@ mailing lists when we plan to get together for a call and make
> final decisions on the CFP.
>
> Is that, in fact, what you were thinking, Rafael?
>
> Talk to you soon,
> Mike
>
> On 4/4/18, 2:58 PM, "Rafael Weingärtner" 
> wrote:
>
> I think everybody that “raised their hands here” already signed up to
> review.
>
> Mike, what about if we only gathered the reviews from Apache main
> review
> system, and then we use that to decide which presentations will get in
> CloudStack tracks? Then, we reduce the work on our side (we also remove
> bias…). I do believe that the review from other peers from Apache
> community
> (even the one outside from our small community) will be fair and
> technical
> (meaning, without passion and or favoritism).
>
> Having said that, I think we only need a small group of PMCs to gather
> the
> results and out of the best ranked proposals, we pick the ones to our
> tracks.
>
> What do you (Mike) and others think?
>
>
> On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike <
> mike.tutkow...@netapp.com>
> wrote:
>
> > Hi Ron,
> >
> > I don’t actually have insight into how many people have currently
> signed
> > up online to be CFP reviewers for ApacheCon. At present, I’m only
> aware of
> > those who have responded to this e-mail chain.
> >
> > We should be able to find out more in the coming weeks. We’re still
> quite
> > early in the process.
> >
> > Thanks for your feedback,
> > Mike
> >
> > On 4/1/18, 9:18 AM, "Ron Wheeler" 
> wrote:
> >
> > How many people have signed up to be reviewers?
> >
> > I don't think that scheduling is part of the review process and
> that
> > can
> > be done by the person/team "organizing" ApacheCon on behalf of
> the PMC.
> >
> > To me review is looking at content for
> > - relevance
> > - quality of the presentations (suggest fixes to content,
> English,
> > graphics, etc.)
> > This should result in a consensus score
> > - Perfect - ready for prime time
> > - Needs minor changes as documented by the reviewers
> > - Great topic but needs more work - perhaps a reviewer could
> volunteer
> > to work with the presenter to get it ready if chosen
> > - Not recommended for topic or content reasons
> >
> > The reviewers could also make non-binding recommendations about
> the
> > balance between topics - marketing(why Cloudstack),
> > Operations/implementation, Technical details, Roadmap, etc.
> based on
> > what they have seen.
> >
> > This should be used by the organizers to make the choices and
> organize
> > the program.
> > The organizers have the final say on the choice of presentations
> and
> > schedule
> >
> > Reviewers are there to help the process not control it.
> >
> > I would be worried that you do not have enough reviewers rather
> than
> > too
> > many.
> > Then the work falls on the PMC and organizers.
> >
> > When planning meetings, I would recommend that you clearly
> separate the
> > roles and only invite the reviewers to the meetings about
> review. Get
> > the list of presentation to present to the reviewers and decide
> if
> > there
> > are any instructions that you want to give to reviewers.
> > I would recommend that you keep the organizing group small.
> Membership
> > should be set by the PMC and should be people that are committed
> to the
> > ApacheCon project and have the time. The committee can request
> help for
> > specific tasks from others in the community who are not on the
> > committee.
> >
> > I would also recommend that organizers do not do reviews. They
> should
> > read the finalists but if they do reviews, there may be a
> suggestion of
> > favouring presentations that they reviewed. It also ensures that
> the
> > organizers are not getting heat from rejected presenters - "it
> is the
> > reviewers fault you did not get selected".
> >
> > My advice is to get as many reviewers as you can so that no one
> is
> > 

Re: Committee to Sort through CCC Presentation Submissions

2018-04-05 Thread Tutkowski, Mike
Hi Rafael,

I think as long as we (the CloudStack Community) have the final say on how we 
fill our allotted slots in the CloudStack track of ApacheCon in Montreal, then 
it’s perfectly fine for us to leverage Apache’s normal review process to gather 
all the feedback from the larger Apache Community.

As you say, we could wait for the feedback to come in via that mechanism and 
then, as per Will’s earlier comments, we could advertise on our users@ and dev@ 
mailing lists when we plan to get together for a call and make final decisions 
on the CFP.

Is that, in fact, what you were thinking, Rafael?

Talk to you soon,
Mike

On 4/4/18, 2:58 PM, "Rafael Weingärtner"  wrote:

I think everybody that “raised their hands here” already signed up to
review.

Mike, what about if we only gathered the reviews from Apache main review
system, and then we use that to decide which presentations will get in
CloudStack tracks? Then, we reduce the work on our side (we also remove
bias…). I do believe that the review from other peers from Apache community
(even the one outside from our small community) will be fair and technical
(meaning, without passion and or favoritism).

Having said that, I think we only need a small group of PMCs to gather the
results and out of the best ranked proposals, we pick the ones to our
tracks.

What do you (Mike) and others think?


On Tue, Apr 3, 2018 at 5:07 PM, Tutkowski, Mike 
wrote:

> Hi Ron,
>
> I don’t actually have insight into how many people have currently signed
> up online to be CFP reviewers for ApacheCon. At present, I’m only aware of
> those who have responded to this e-mail chain.
>
> We should be able to find out more in the coming weeks. We’re still quite
> early in the process.
>
> Thanks for your feedback,
> Mike
>
> On 4/1/18, 9:18 AM, "Ron Wheeler"  wrote:
>
> How many people have signed up to be reviewers?
>
> I don't think that scheduling is part of the review process and that
> can
> be done by the person/team "organizing" ApacheCon on behalf of the 
PMC.
>
> To me review is looking at content for
> - relevance
> - quality of the presentations (suggest fixes to content, English,
> graphics, etc.)
> This should result in a consensus score
> - Perfect - ready for prime time
> - Needs minor changes as documented by the reviewers
> - Great topic but needs more work - perhaps a reviewer could volunteer
> to work with the presenter to get it ready if chosen
> - Not recommended for topic or content reasons
>
> The reviewers could also make non-binding recommendations about the
> balance between topics - marketing(why Cloudstack),
> Operations/implementation, Technical details, Roadmap, etc. based on
> what they have seen.
>
> This should be used by the organizers to make the choices and organize
> the program.
> The organizers have the final say on the choice of presentations and
> schedule
>
> Reviewers are there to help the process not control it.
>
> I would be worried that you do not have enough reviewers rather than
> too
> many.
> Then the work falls on the PMC and organizers.
>
> When planning meetings, I would recommend that you clearly separate 
the
> roles and only invite the reviewers to the meetings about review. Get
> the list of presentation to present to the reviewers and decide if
> there
> are any instructions that you want to give to reviewers.
> I would recommend that you keep the organizing group small. Membership
> should be set by the PMC and should be people that are committed to 
the
> ApacheCon project and have the time. The committee can request help 
for
> specific tasks from others in the community who are not on the
> committee.
>
> I would also recommend that organizers do not do reviews. They should
> read the finalists but if they do reviews, there may be a suggestion 
of
> favouring presentations that they reviewed. It also ensures that the
> organizers are not getting heat from rejected presenters - "it is the
> reviewers fault you did not get selected".
>
> My advice is to get as many reviewers as you can so that no one is
> essential and each reviewer has a limited number of presentations to
> review but each presentation gets reviewed by multiple people. Also
> bear
> in mind that not all reviewers have the same ability to review each
> presentation.
> Reviews should be anonymous and only the summary comments given to 

[NOTICE] Remove branches CS-2163, Commit-Ratio, dedicate*, and bugfix* from Apache CloudStack official repository

2018-04-05 Thread Rafael Weingärtner
Following the protocol defined in [1], this is the notice email regarding
the removal branches from Apache CloudStack official repository. The Jira
ticket for the branches removal is
https://issues.apache.org/jira/browse/CLOUDSTACK-10354. The branches that
will be removed are the following:

   - CS-2163
   - Commit-Ratio
   - dedicate-guest-vlan-ranges
   - dedicate-guest-vlan-ranges_2
   - dedicate_public_ip_range
   - dedicate_public_ip_range_2
   - bugfix/CID-1114591
   - bugfix/CID-1114601
   - bugfix/CID-1116300
   - bugfix/CID-1116654
   - bugfix/CID-1116850
   - bugfix/CID-116538
   - bugfix/CID-1192805
   - bugfix/CID-1192810
   - bugfix/CID-1212198
   - bugfix/CID-106
   - bugfix/CID-1230585
   - bugfix/CID-1230587
   - bugfix/CID-1230587-2ndtime
   - bugfix/CID-1232333
   - bugfix/CID-1240106
   - bugfix/CID-1241966
   - bugfix/CID-1241967
   - bugfix/CID-1249800
   - bugfix/CID-1249801
   - bugfix/CID-1249803
   - bugfix/CID-1254835
   - bugfix/CS-7580
   - bugfix/CS-7665
   - bugfix/TO-hierarchy-flatening

If you have objections, please do share your concerns before the deletion.
The removal will happen on 13/April/2018.

[1]
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Clean+up+old+and+obsolete+branches+protocol

-- 
Rafael Weingärtner


Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread Andrija Panic
Hi Ilya,

thanks for the feedback - but in "real world", you need to "understand"
that 60min is next to useless timeout for some jobs (if I understand this
specific parameter correctly ?? - job is really canceled, not only job
monitoring is canceled ???) -

My value for the  "job.cancel.threshold.minutes" is 2880 minutes (2 days?)

I can tell you when you have CEPH/NFS (CEPH even "worse" case, since slower
read durign qemu-img convert process...) of 500GB, then imagine snapshot
job will take many hours. Should I mention 1TB volumes (yes, we had
client's like that...)
Than attaching 1TB volume, that was uploaded to ACS (lives originally on
Secondary Storage, and takes time to be copied over to NFS/CEPH) will take
up to few hours.
Then migrating 1TB volume from NFS to CEPH, or CEPH to NFS, also takes
time...etc.

I'm just giving you feedback as "user", admin of the cloud, zero DEV skills
here :) , just to make sure you make practical decisions (and I admit I
might be wrong with my stuff, but just giving you feedback from our public
cloud setup)


Cheers!




On 5 April 2018 at 15:16, Tutkowski, Mike  wrote:

> Wow, there’s been a lot of good details noted from several people on how
> this process works today and how we’d like it to work in the near future.
>
> 1) Any chance this is already documented on the Wiki?
>
> 2) If not, any chance someone would be willing to do so (a flow diagram
> would be particularly useful).
>
> > On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier 
> wrote:
> >
> > Hi all,
> >
> > Good point ilya but as stated by Sergey there's more thing to consider
> > before being able to do a proper shutdown. I augmented my script I gave
> you
> > originally and changed code in CS. What we're doing for our environment
> is
> > as follow:
> >
> > 1. the MGMT looks for a change in the file /etc/lb-agent which contains
> > keywords for HAproxy[2] (ready, maint) so that HA-proxy can disable the
> > mgmt on the keyword "maint" and the mgmt server stops a couple of
> > threads[1] to stop processing async jobs in the queue
> > 2. Looks for the async jobs and wait until there is none to ensure you
> can
> > send the reconnect commands (if jobs are running, a reconnect will result
> > in a failed job since the result will never reach the management server -
> > the agent waits for the current job to be done before reconnecting, and
> > discard the result... rooms for improvement here!)
> > 3. Issue a reconnectHost command to all the hosts connected to the mgmt
> > server so that they reconnect to another one, otherwise the mgmt must be
> up
> > since it is used to forward commands to agents.
> > 4. when all agents are reconnected, we can shutdown the management server
> > and perform the maintenance.
> >
> > One issue remains for me, during the reconnect, the commands that are
> > processed at the same time should be kept in a queue until the agents
> have
> > finished any current jobs and have reconnected. Today the little time
> > window during which the reconnect happens can lead to failed jobs due to
> > the agent not being connected at the right moment.
> >
> > I could push a PR for the change to stop some processing threads based on
> > the content of a file. It's possible also to cancel the drain of the
> > management by simply changing the content of the file back to "ready"
> > again, instead of "maint" [2].
> >
> > [1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector
> > [2] HA proxy documentation on agent checker: https://cbonte.github.io/
> > haproxy-dconv/1.6/configuration.html#5.2-agent-check
> >
> > Regarding your issue on the port blocking, I think it's fair to consider
> > that if you want to shutdown your server at some point, you have to stop
> > serving (some) requests. Here the only way it's to stop serving
> everything.
> > If the API had a REST design, we could reject any POST/PUT/DELETE
> > operations and allow GET ones. I don't know how hard it would be today to
> > only allow listBaseCmd operations to be more friendly with the users.
> >
> > Marco
> >
> >
> > On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy 
> > wrote:
> >
> >> Now without spellchecking :)
> >>
> >> This is not simple e.g. for VMware. Each management server also acts as
> an
> >> agent proxy so tasks against a particular ESX host will be always
> >> forwarded. That right answer will be to support a native “maintenance
> mode”
> >> for management server. When entered to such mode the management server
> >> should release all agents including SSVM, block/redirect API calls and
> >> login request and finish all async job it originated.
> >>
> >>
> >>
> >> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy   >> serg...@hotmail.com>> wrote:
> >>
> >> This is not simple e.g. for VMware. Each management server also acts as
> an
> >> agent proxy so tasks against a particular ESX host will be always
> >> forwarded. That right 

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread Tutkowski, Mike
Wow, there’s been a lot of good details noted from several people on how this 
process works today and how we’d like it to work in the near future.

1) Any chance this is already documented on the Wiki?

2) If not, any chance someone would be willing to do so (a flow diagram would 
be particularly useful).

> On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier  wrote:
> 
> Hi all,
> 
> Good point ilya but as stated by Sergey there's more thing to consider
> before being able to do a proper shutdown. I augmented my script I gave you
> originally and changed code in CS. What we're doing for our environment is
> as follow:
> 
> 1. the MGMT looks for a change in the file /etc/lb-agent which contains
> keywords for HAproxy[2] (ready, maint) so that HA-proxy can disable the
> mgmt on the keyword "maint" and the mgmt server stops a couple of
> threads[1] to stop processing async jobs in the queue
> 2. Looks for the async jobs and wait until there is none to ensure you can
> send the reconnect commands (if jobs are running, a reconnect will result
> in a failed job since the result will never reach the management server -
> the agent waits for the current job to be done before reconnecting, and
> discard the result... rooms for improvement here!)
> 3. Issue a reconnectHost command to all the hosts connected to the mgmt
> server so that they reconnect to another one, otherwise the mgmt must be up
> since it is used to forward commands to agents.
> 4. when all agents are reconnected, we can shutdown the management server
> and perform the maintenance.
> 
> One issue remains for me, during the reconnect, the commands that are
> processed at the same time should be kept in a queue until the agents have
> finished any current jobs and have reconnected. Today the little time
> window during which the reconnect happens can lead to failed jobs due to
> the agent not being connected at the right moment.
> 
> I could push a PR for the change to stop some processing threads based on
> the content of a file. It's possible also to cancel the drain of the
> management by simply changing the content of the file back to "ready"
> again, instead of "maint" [2].
> 
> [1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector
> [2] HA proxy documentation on agent checker: https://cbonte.github.io/
> haproxy-dconv/1.6/configuration.html#5.2-agent-check
> 
> Regarding your issue on the port blocking, I think it's fair to consider
> that if you want to shutdown your server at some point, you have to stop
> serving (some) requests. Here the only way it's to stop serving everything.
> If the API had a REST design, we could reject any POST/PUT/DELETE
> operations and allow GET ones. I don't know how hard it would be today to
> only allow listBaseCmd operations to be more friendly with the users.
> 
> Marco
> 
> 
> On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy 
> wrote:
> 
>> Now without spellchecking :)
>> 
>> This is not simple e.g. for VMware. Each management server also acts as an
>> agent proxy so tasks against a particular ESX host will be always
>> forwarded. That right answer will be to support a native “maintenance mode”
>> for management server. When entered to such mode the management server
>> should release all agents including SSVM, block/redirect API calls and
>> login request and finish all async job it originated.
>> 
>> 
>> 
>> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy  serg...@hotmail.com>> wrote:
>> 
>> This is not simple e.g. for VMware. Each management server also acts as an
>> agent proxy so tasks against a particular ESX host will be always
>> forwarded. That right answer will be to a native support for “maintenance
>> mode” for management server. When entered to such mode the management
>> server should release all agents including save, block/redirect API calls
>> and login request and finish all a sync job it originated.
>> 
>> Sent from my iPhone
>> 
>> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
>> rafaelweingart...@gmail.com> wrote:
>> 
>> Ilya, still regarding the management server that is being shut down issue;
>> if other MSs/or maybe system VMs (I am not sure to know if they are able to
>> do such tasks) can direct/redirect/send new jobs to this management server
>> (the one being shut down), the process might never end because new tasks
>> are always being created for the management server that we want to shut
>> down. Is this scenario possible?
>> 
>> That is why I mentioned blocking the port 8250 for the “graceful-shutdown”.
>> 
>> If this scenario is not possible, then everything s fine.
>> 
>> 
>> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev > >
>> wrote:
>> 
>> I'm thinking of using a configuration from "job.cancel.threshold.minutes" -
>> it will be the longest
>> 
>>"category": "Advanced",
>> 
>>"description": "Time (in 

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-05 Thread Marc-Aurèle Brothier
Hi all,

Good point ilya but as stated by Sergey there's more thing to consider
before being able to do a proper shutdown. I augmented my script I gave you
originally and changed code in CS. What we're doing for our environment is
as follow:

1. the MGMT looks for a change in the file /etc/lb-agent which contains
keywords for HAproxy[2] (ready, maint) so that HA-proxy can disable the
mgmt on the keyword "maint" and the mgmt server stops a couple of
threads[1] to stop processing async jobs in the queue
2. Looks for the async jobs and wait until there is none to ensure you can
send the reconnect commands (if jobs are running, a reconnect will result
in a failed job since the result will never reach the management server -
the agent waits for the current job to be done before reconnecting, and
discard the result... rooms for improvement here!)
3. Issue a reconnectHost command to all the hosts connected to the mgmt
server so that they reconnect to another one, otherwise the mgmt must be up
since it is used to forward commands to agents.
4. when all agents are reconnected, we can shutdown the management server
and perform the maintenance.

One issue remains for me, during the reconnect, the commands that are
processed at the same time should be kept in a queue until the agents have
finished any current jobs and have reconnected. Today the little time
window during which the reconnect happens can lead to failed jobs due to
the agent not being connected at the right moment.

I could push a PR for the change to stop some processing threads based on
the content of a file. It's possible also to cancel the drain of the
management by simply changing the content of the file back to "ready"
again, instead of "maint" [2].

[1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector
[2] HA proxy documentation on agent checker: https://cbonte.github.io/
haproxy-dconv/1.6/configuration.html#5.2-agent-check

Regarding your issue on the port blocking, I think it's fair to consider
that if you want to shutdown your server at some point, you have to stop
serving (some) requests. Here the only way it's to stop serving everything.
If the API had a REST design, we could reject any POST/PUT/DELETE
operations and allow GET ones. I don't know how hard it would be today to
only allow listBaseCmd operations to be more friendly with the users.

Marco


On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy 
wrote:

> Now without spellchecking :)
>
> This is not simple e.g. for VMware. Each management server also acts as an
> agent proxy so tasks against a particular ESX host will be always
> forwarded. That right answer will be to support a native “maintenance mode”
> for management server. When entered to such mode the management server
> should release all agents including SSVM, block/redirect API calls and
> login request and finish all async job it originated.
>
>
>
> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy > wrote:
>
> This is not simple e.g. for VMware. Each management server also acts as an
> agent proxy so tasks against a particular ESX host will be always
> forwarded. That right answer will be to a native support for “maintenance
> mode” for management server. When entered to such mode the management
> server should release all agents including save, block/redirect API calls
> and login request and finish all a sync job it originated.
>
> Sent from my iPhone
>
> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> Ilya, still regarding the management server that is being shut down issue;
> if other MSs/or maybe system VMs (I am not sure to know if they are able to
> do such tasks) can direct/redirect/send new jobs to this management server
> (the one being shut down), the process might never end because new tasks
> are always being created for the management server that we want to shut
> down. Is this scenario possible?
>
> That is why I mentioned blocking the port 8250 for the “graceful-shutdown”.
>
> If this scenario is not possible, then everything s fine.
>
>
> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev  >
> wrote:
>
> I'm thinking of using a configuration from "job.cancel.threshold.minutes" -
> it will be the longest
>
> "category": "Advanced",
>
> "description": "Time (in minutes) for async-jobs to be forcely
> cancelled if it has been in process for long",
>
> "name": "job.cancel.threshold.minutes",
>
> "value": "60"
>
>
>
>
> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> Big +1 for this feature; I only have a few doubts.
>
> * Regarding the tasks/jobs that management servers (MSs) execute; are
> these
> tasks originate from requests that come to the MS, or is it possible that
> requests received by one management server