Re: [Gluster-infra] Reboot policy for the infra

2018-08-23 Thread Michael Scherer
Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> One more piece that's missing is when we'll restart the physical
> servers.
> That seems to be entirely missing. The rest looks good to me and I'm
> happy
> to add an item to next sprint to automate the node rebooting.

That's covered as "as critical as the services that depend on them.

Now, the problem I do have is that some server (myrmicinae to name it)
do take 30 minutes to reboot, and I can't diagnose nor fix without
taking hours. This is the one running gerrit/jenkins, so that's not
possible to spent time on this kind of test.



> On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer 
> wrote:
> 
> > Hi,
> > 
> > so that's kernel reboot time again, this time courtesy of Intel
> > (again). I do not consider the issue to be "OMG the sky is
> > falling",
> > but enough to take time to streamline our process to reboot.
> > 
> > 
> > 
> > Currently, we do not have a policy or anything, and I think the
> > negociation time around that is cumbersome:
> > - we need to reach people, which take time and add latency (would
> > be
> > bad if that was a urgent issue, and likely add undeed stress while
> > waiting)
> > 
> > - we need to keep track of what was supposed to be done, which is
> > also
> > cumbersome
> > 
> > While that's not a problem if I had only gluster to deal with, my
> > team
> > of 3 do have to deal with a few more projects than 1, and
> > orchestrating
> > choice for a dozen of group is time consuming (just think last time
> > you
> > had to go to a restaurant after a conference to see how hard it is
> > to
> > reach agreements).
> > 
> > So I would propose that we simplify that with the following policy:
> > 
> > - Jenkins builder would be reboot by jenkins on a regular basis.
> > I do not know how we can do that, but given that we have enough
> > node to
> > sustain builds, it shouldn't impact developpers in a big way. The
> > only
> > exception is the freebsd builder, since we only have 1 functionnal
> > at
> > the moment. But once the 2nd is working, it should be treated like
> > the
> > others.
> > 
> > - service in HA (firewall, reverse proxy, internal squid/DNS) would
> > be
> > reboot during the day without notice. Due to working HA, that's non
> > user impacting. In fact, that's already what I do.
> > 
> > - service not in HA should be pushed for HA (gerrit might get there
> > one
> > day, no way for jenkins :/, need to see for postgres and so
> > fstat/softserve, and maybe try to get something for
> > download.gluster.org)
> > 
> > - service critical and not in HA should be announced in advance.
> > Critical mean the service listed here: https://gluster-infra-docs.r
> > eadt
> > hedocs.io/emergency.html
> > 
> > - service non visible to end user (backup servers, ansible
> > deployment
> > etc) can be reboot at will
> > 
> > Then the only question is what about stuff not in the previous
> > category, like softserve, fstat.
> > 
> > Also, all dependencies are as critical as the most critical service
> > that depend on them. So hypervisors hosting gerrit/jenkins are
> > critical
> > (until we find a way to avoid outage), the ones for builders are
> > not.
> > 
> > 
> > 
> > Thoughts, ideas ?
> > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > ___
> > Gluster-infra mailing list
> > Gluster-infra@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-infra
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Reboot policy for the infra

2018-08-22 Thread Nigel Babu
One more piece that's missing is when we'll restart the physical servers.
That seems to be entirely missing. The rest looks good to me and I'm happy
to add an item to next sprint to automate the node rebooting.

On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer  wrote:

> Hi,
>
> so that's kernel reboot time again, this time courtesy of Intel
> (again). I do not consider the issue to be "OMG the sky is falling",
> but enough to take time to streamline our process to reboot.
>
>
>
> Currently, we do not have a policy or anything, and I think the
> negociation time around that is cumbersome:
> - we need to reach people, which take time and add latency (would be
> bad if that was a urgent issue, and likely add undeed stress while
> waiting)
>
> - we need to keep track of what was supposed to be done, which is also
> cumbersome
>
> While that's not a problem if I had only gluster to deal with, my team
> of 3 do have to deal with a few more projects than 1, and orchestrating
> choice for a dozen of group is time consuming (just think last time you
> had to go to a restaurant after a conference to see how hard it is to
> reach agreements).
>
> So I would propose that we simplify that with the following policy:
>
> - Jenkins builder would be reboot by jenkins on a regular basis.
> I do not know how we can do that, but given that we have enough node to
> sustain builds, it shouldn't impact developpers in a big way. The only
> exception is the freebsd builder, since we only have 1 functionnal at
> the moment. But once the 2nd is working, it should be treated like the
> others.
>
> - service in HA (firewall, reverse proxy, internal squid/DNS) would be
> reboot during the day without notice. Due to working HA, that's non
> user impacting. In fact, that's already what I do.
>
> - service not in HA should be pushed for HA (gerrit might get there one
> day, no way for jenkins :/, need to see for postgres and so
> fstat/softserve, and maybe try to get something for
> download.gluster.org)
>
> - service critical and not in HA should be announced in advance.
> Critical mean the service listed here: https://gluster-infra-docs.readt
> hedocs.io/emergency.html
>
> - service non visible to end user (backup servers, ansible deployment
> etc) can be reboot at will
>
> Then the only question is what about stuff not in the previous
> category, like softserve, fstat.
>
> Also, all dependencies are as critical as the most critical service
> that depend on them. So hypervisors hosting gerrit/jenkins are critical
> (until we find a way to avoid outage), the ones for builders are not.
>
>
>
> Thoughts, ideas ?
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra