Re: [Gluster-devel] Announcing Gluster for Container Storage (GCS)

2018-08-23 Thread Joe Julian
Personally, I'd like to see the glusterd service replaced by a k8s native 
controller (named "kluster").

I'm hoping to use this vacation I'm currently on to write up a design doc.

On August 23, 2018 12:58:03 PM PDT, Michael Adam  wrote:
>On 2018-07-25 at 06:38 -0700, Vijay Bellur wrote:
>> Hi all,
>
>Hi Vijay,
>
>Thanks for announcing this to the public and making everyone
>more aware of Gluster's focus on container storage!
>
>I would like to add an additional perspective to this,
>giving some background about the history and origins:
>
>Integrating Gluster with kubernetes for providing
>persistent storage for containerized applications is
>not new. We have been working on this since more than
>two years now, and it is used by many community users
>and and many customers (of Red Hat) in production.
>
>The original software stack used heketi
>(https://github.com/heketi/heketi) as a high level service
>interface for gluster to facilitate the easy self-service for
>provisioning volumes in kubernetes. Heketi implemented some ideas
>that were originally part of the glusterd2 plans already in a
>separate, much more narrowly scoped project to get us started
>with these efforts in the first place, and also went beyond those
>original ideas.  These features are now being merged into
>glusterd2 which will in the future replace heketi in the
>container storage stack.
>
>We were also working on kubernetes itself, writing the
>privisioners for various forms of gluster volumes in kubernets
>proper (https://github.com/kubernetes/kubernetes) and also the
>external storage repo
>(https://github.com/kubernetes-incubator/external-storage).
>Those provisioners will eventually be replaced by the mentioned
>csi drivers. The expertise of the original kubernetes
>development is now flowing into the CSI drivers.
>
>The gluster-containers repository was created and used
>for this original container-storage effort already.
>
>The mentioned https://github.com/gluster/gluster-kubernetes
>repository was not only the place for storing the deployment
>artefacts and tools, but it was actually intended to be the
>upstream home of the gluster-container-storage project.
>
>In this view, I see the GCS project announced here
>as a GCS version 2. The original first version,
>even though never officially announced that widely in a formal
>introduction like this, and never given a formal release
>or version number (let me call it version one), was the
>software stack described above and homed at the
>gluster-kubernetes repository. If you look at this project
>(and heketi), you see that it has a nice level of popularity.
>
>I think we should make use of this traction instead of
>ignoring the legacy, and turn gluster-kubernetes into the
>home of GCS (v2). In my view, GCS (v2) will be about:
>
>* replacing some of the components with newer, i.e.
>  - i.e. glusterd2 instead of the heketi and glusterd1 combo
>  - csi drivers (the new standard) instead of native
>kubernetes plugins
>* adding the operator feature,
>  (even though we are currently also working on an operator
>  for the current stack with heketi and traditional gluster,
>  since this will become important in production before
>  this v2 will be ready.)
>
>These are my 2cents on this topic.
>I hope someone finds them useful.
>
>I am very excited to (finally) see the broader gluster
>community getting more aligned behind the idea of bringing
>our great SDS system into the space of containers! :-)
>
>Cheers - Michael
>
>
>
>
>
>> We would like to let you  know that some of us have started focusing
>on an
>> initiative called ‘Gluster for Container Storage’ (in short GCS). As
>of
>> now, one can already use Gluster as storage for containers by making
>use of
>> different projects available in github repositories associated with
>gluster
>>  & Heketi
>.
>> The goal of the GCS initiative is to provide an easier integration of
>these
>> projects so that they can be consumed together as designed. We are
>> primarily focused on integration with Kubernetes (k8s) through this
>> initiative.
>> 
>> Key projects for GCS include:
>> Glusterd2 (GD2)
>> 
>> Repo: https://github.com/gluster/glusterd2
>> 
>> The challenge we have with current management layer of Gluster
>(glusterd)
>> is that it is not designed for a service oriented architecture.
>Heketi
>> overcame this limitation and made Gluster consumable in k8s by
>providing
>> all the necessary hooks needed for supporting Persistent Volume
>Claims.
>> 
>> Glusterd2 provides a service oriented architecture for volume &
>cluster
>> management. Gd2 also intends to provide many of the Heketi
>functionalities
>> needed by Kubernetes natively. Hence we are working on merging Heketi
>with
>> gd2 and you can follow more of this action in the issues associated
>with
>> the gd2 github repository.
>> gluster-block
>> 
>> Repo: https://github.com/gluster/gluster-block
>> 
>> This project inten

[Gluster-devel] Post mortem of 2018-08-23 (2 for the price of one)

2018-08-23 Thread Michael Scherer
Hi,

so we had 3 incidents in the last 24h, and while all of them are
different, they are also linked.

So we did face several issues, starting by gerrit showing error 500
last night, around 23h Paris. 

That was https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , and did
result in a memory upgrade this morning.


Then we started to look at others issues that were uncovered while
investigating the first, and i tried to look at the size of the mail
queue. Usually, this is not a problem, but after adding swap, it did
become a issue. 

So I started to look for a way to blacklist mail sent to
jenk...@build.gluster.org, first by routing this mail domain to
supercolony, then by changing postifx to drop the mail.

And then we got 2 issue at once, timeline in UTC 


Timeline


13:42  misc add a MX for build.gluster.org in the zone. To do that, the
dns zone was changed and build.gluster.org could no longer be a CNAME. 

14:56  kaleb ping misc/nigel saying "there is a message about disk full
on that job"

15:00  misc click on the link to build.gluster.org, is greeted by a ssl
error about certificat. Seems the DNS now resolve build.gluster.org to
2 IP instead of 1

15:04  misc revert the DNS, cause no time to investigate. 

15:05  misc figure the server has a full disk because the logs are
stored on /

15:07  misc also start to swear in 2 languages

15:18  a new partition with more space is created on
http.int.rht.gluster.org data is copied, httpd restarted, situation is
back to normal

Impact:
- some build logs were lost (likely not much)
- for 1h, some people could have been randomly directed to the wrong
server when going to build.gluster.org


Root cause:
- for DNS, a wrong commit. The syntax did look correct (and was
verified), so I need to check why it did more than required.

- for the disk full, a increase of patches and a oversight on that
server installation.


Resolution:
- dns got reverted
- new partition was added and data were copied

What went well:
- we were quickly able to resolve the issue thanks to automation

When we were lucky:
- the issue got detected fast by the same person who made the change
(DNS), and people (Kaleb) notified us as soon as something seemed weird
(disk)
- none of us were in Vancouver facing a measle outbreak

What went bad
- still no monitoring


Potential improvement to make:
- add monitoring
- revise ressources usage 
- prepare a template for post mortem

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Announcing Gluster for Container Storage (GCS)

2018-08-23 Thread Michael Adam
On 2018-07-25 at 06:38 -0700, Vijay Bellur wrote:
> Hi all,

Hi Vijay,

Thanks for announcing this to the public and making everyone
more aware of Gluster's focus on container storage!

I would like to add an additional perspective to this,
giving some background about the history and origins:

Integrating Gluster with kubernetes for providing
persistent storage for containerized applications is
not new. We have been working on this since more than
two years now, and it is used by many community users
and and many customers (of Red Hat) in production.

The original software stack used heketi
(https://github.com/heketi/heketi) as a high level service
interface for gluster to facilitate the easy self-service for
provisioning volumes in kubernetes. Heketi implemented some ideas
that were originally part of the glusterd2 plans already in a
separate, much more narrowly scoped project to get us started
with these efforts in the first place, and also went beyond those
original ideas.  These features are now being merged into
glusterd2 which will in the future replace heketi in the
container storage stack.

We were also working on kubernetes itself, writing the
privisioners for various forms of gluster volumes in kubernets
proper (https://github.com/kubernetes/kubernetes) and also the
external storage repo
(https://github.com/kubernetes-incubator/external-storage).
Those provisioners will eventually be replaced by the mentioned
csi drivers. The expertise of the original kubernetes
development is now flowing into the CSI drivers.

The gluster-containers repository was created and used
for this original container-storage effort already.

The mentioned https://github.com/gluster/gluster-kubernetes
repository was not only the place for storing the deployment
artefacts and tools, but it was actually intended to be the
upstream home of the gluster-container-storage project.

In this view, I see the GCS project announced here
as a GCS version 2. The original first version,
even though never officially announced that widely in a formal
introduction like this, and never given a formal release
or version number (let me call it version one), was the
software stack described above and homed at the
gluster-kubernetes repository. If you look at this project
(and heketi), you see that it has a nice level of popularity.

I think we should make use of this traction instead of
ignoring the legacy, and turn gluster-kubernetes into the
home of GCS (v2). In my view, GCS (v2) will be about:

* replacing some of the components with newer, i.e.
  - i.e. glusterd2 instead of the heketi and glusterd1 combo
  - csi drivers (the new standard) instead of native
kubernetes plugins
* adding the operator feature,
  (even though we are currently also working on an operator
  for the current stack with heketi and traditional gluster,
  since this will become important in production before
  this v2 will be ready.)

These are my 2cents on this topic.
I hope someone finds them useful.

I am very excited to (finally) see the broader gluster
community getting more aligned behind the idea of bringing
our great SDS system into the space of containers! :-)

Cheers - Michael





> We would like to let you  know that some of us have started focusing on an
> initiative called ‘Gluster for Container Storage’ (in short GCS). As of
> now, one can already use Gluster as storage for containers by making use of
> different projects available in github repositories associated with gluster
>  & Heketi .
> The goal of the GCS initiative is to provide an easier integration of these
> projects so that they can be consumed together as designed. We are
> primarily focused on integration with Kubernetes (k8s) through this
> initiative.
> 
> Key projects for GCS include:
> Glusterd2 (GD2)
> 
> Repo: https://github.com/gluster/glusterd2
> 
> The challenge we have with current management layer of Gluster (glusterd)
> is that it is not designed for a service oriented architecture. Heketi
> overcame this limitation and made Gluster consumable in k8s by providing
> all the necessary hooks needed for supporting Persistent Volume Claims.
> 
> Glusterd2 provides a service oriented architecture for volume & cluster
> management. Gd2 also intends to provide many of the Heketi functionalities
> needed by Kubernetes natively. Hence we are working on merging Heketi with
> gd2 and you can follow more of this action in the issues associated with
> the gd2 github repository.
> gluster-block
> 
> Repo: https://github.com/gluster/gluster-block
> 
> This project intends to expose files in a gluster volume as block devices.
> Gluster-block enables supporting ReadWriteOnce (RWO) PVCs and the
> corresponding workloads in Kubernetes using gluster as the underlying
> storage technology.
> 
> Gluster-block is intended to be consumed by stateful RWO applications like
> databases and k8s infrastructure services like logging, metrics

Re: [Gluster-devel] [Gluster-infra] 2 outages: http.int.rht.gluster.org disk full and DNS issue

2018-08-23 Thread Michael Scherer
Le jeudi 23 août 2018 à 17:12 +0200, Michael Scherer a écrit :
> Hi,
> 
> quick note, we have 2 outages at the moment:
> 
> - I changed build.gluster.org DNS? but somehow, it do redirect to
> supercolony.gluster and jenkins. Why, I a not sure, but I reverted my
> DNS change and will search more, cause the syntax did look ok to me.
> 
> So we have to wait until DNS is propagated. No work around for now.
> 
> 
> - at the same time, Kaleb pointed that http.int is full. Seems we
> didn't use separate partition for that, so / went full. I am working
> on
> this right now

So:
- DNS change have been reverted unil I figure, seems to have fixed what
I did see.

- more disk space and a partition have been added to the server

I will write a post mortem after my meeting

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] 2 outages: http.int.rht.gluster.org disk full and DNS issue

2018-08-23 Thread Michael Scherer
Hi,

quick note, we have 2 outages at the moment:

- I changed build.gluster.org DNS? but somehow, it do redirect to
supercolony.gluster and jenkins. Why, I a not sure, but I reverted my
DNS change and will search more, cause the syntax did look ok to me.

So we have to wait until DNS is propagated. No work around for now.


- at the same time, Kaleb pointed that http.int is full. Seems we
didn't use separate partition for that, so / went full. I am working on
this right now


-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Coverity covscan for 2018-08-23-59e56024 (master branch)

2018-08-23 Thread staticanalysis


GlusterFS Coverity covscan results for the master branch are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2018-08-23-59e56024/

Coverity covscan results for other active branches are also available at
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Python components and test coverage

2018-08-23 Thread Nigel Babu
On Fri, Aug 10, 2018 at 5:59 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Fri, Aug 10, 2018 at 5:47 PM, Nigel Babu  wrote:
> > Hello folks,
> >
> > We're currently in a transition to python3. Right now, there's a bug in
> one
> > piece of this transition code. I saw Nithya run into this yesterday. The
> > challenge here is, none of our testing for python2/python3 transition
> > catches this bug. Both Pylint and the ast-based testing that Kaleb
> > recommended does not catch this bug. The bug is trivial and would take 2
> > mins to fix, the challenge is that until we exercise almost all of these
> > code paths from both Python3 and Python2, we're not going to find out
> that
> > there are subtle breakages like this.
> >
>
> Where is this great reveal - what is this above mentioned bug?
>
> > As far as I know, the three pieces where we use Python are geo-rep,
> > glusterfind, and libgfapi-python. My question:
> > * Are there more places where we run python?
> > * What sort of automated test coverage do we have for these components
> right
> > now?
> > * What can the CI team do to help identify problems? We have both Centos7
> > and Fedora28 builders, so we can definitely help run tests specific to
> > python.
>
>
The bugs I mentioned in this email are now fixed:
https://github.com/gluster/glusterfs/commit/bc61ee44a3f8a9bf0490605f62ec27fcd6a5b8d0

I still have no good answer to how what automated testing we have right now
and what are the gaps for both glusterfind and geo-rep. I'd like to know
what is the current state of automated testing and what's the future
planned for both these bits of python code. Note, testing that python2 code
is compliant with python3 has tooling. The reverse has no tooling (Though
there's only 1y 4m left on the lock).

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Emergency reboot on Gerrit today at 09h45 UTC

2018-08-23 Thread Michael Scherer
Le jeudi 23 août 2018 à 11:45 +0200, Michael Scherer a écrit :
> Le jeudi 23 août 2018 à 10:46 +0200, Michael Scherer a écrit :
> > Hi,
> > 
> > as said on https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , we
> > have found that gerrit can't sustain the load when receiving too
> > much
> > patch at once. We are going to reboot the VM to bump the memory and
> > CPUs. We already used the hotplug feature of libvirt, so a reboot
> > is
> > required.
> > 
> > We are going to start that in 1h, so 
> > UTC 09h45
> > BER 11h45 
> > BLR 15h15
> > TLV 12h45
> > BOs 05h45
> > 
> > While the downtime is supposed to be minimal (< 5 minutes), years
> > of
> > experience in this profession taught me that things do not always
> > go
> > well, so we are getting a 2h windows.
> > 
> > We will send a follow up email for the start of the maintainance,
> > and
> > the stop. And also remind to people on irc.
> 
> It is starting now !

We did the memory upgrade, started gerrit, and did the usual check:
- web interface work
- gerrit let us connect
- jenkins seems to be able to send feedback on patch.

So the maintenance windows is finish, do not hesitate to contact us if
there is any issue.

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Emergency reboot on Gerrit today at 09h45 UTC

2018-08-23 Thread Michael Scherer
Le jeudi 23 août 2018 à 10:46 +0200, Michael Scherer a écrit :
> Hi,
> 
> as said on https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , we
> have found that gerrit can't sustain the load when receiving too much
> patch at once. We are going to reboot the VM to bump the memory and
> CPUs. We already used the hotplug feature of libvirt, so a reboot is
> required.
> 
> We are going to start that in 1h, so 
> UTC 09h45
> BER 11h45 
> BLR 15h15
> TLV 12h45
> BOs 05h45
> 
> While the downtime is supposed to be minimal (< 5 minutes), years of
> experience in this profession taught me that things do not always go
> well, so we are getting a 2h windows.
> 
> We will send a follow up email for the start of the maintainance, and
> the stop. And also remind to people on irc.

It is starting now !


-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Wireshark dissectors for Gluster 4.0

2018-08-23 Thread Amar Tumballi
Thanks for this Poornima, Niels,

The patch is now merged in wireshark project. If one of you post here once
the release is made there, it would be great for Gluster Developers to
upgrade.

-Amar

On Fri, Jul 27, 2018 at 2:26 PM, Amar Tumballi  wrote:

> Took a look at the patch!
>
> Looks like you have covered every FOP, other than compound? (which is not
> used anywhere now). Is there anything specific which you missed, where you
> need help to complete it?
>
>
>
>
>
> On Fri, Jul 27, 2018 at 11:32 AM, Poornima Gurusiddaiah <
> pguru...@redhat.com> wrote:
>
>> Hi,
>>
>> Here is the patch for dissecting Gluster 4.0 protocol in wireshark [1].
>> The initial tests for fops seem to be working. Request you all to add
>> missing fops, fix/report any issues in decoding.
>>
>> [1] https://code.wireshark.org/review/#/c/28871
>>
>> Regards,
>> Poornima
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Amar Tumballi (amarts)
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Reboot policy for the infra

2018-08-23 Thread Michael Scherer
Le jeudi 23 août 2018 à 11:37 +0300, Yaniv Kaul a écrit :
> On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer  m>
> wrote:
> 
> > Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> > > One more piece that's missing is when we'll restart the physical
> > > servers.
> > > That seems to be entirely missing. The rest looks good to me and
> > > I'm
> > > happy
> > > to add an item to next sprint to automate the node rebooting.
> > 
> > That's covered as "as critical as the services that depend on them.
> > 
> > Now, the problem I do have is that some server (myrmicinae to name
> > it)
> > do take 30 minutes to reboot, and I can't diagnose nor fix without
> > taking hours. This is the one running gerrit/jenkins, so that's not
> > possible to spent time on this kind of test.
> > 
> 
> You'd imagine people would move to kexec reboots for VMs by now.
> Not sure why it's not catching up.
> (BTW, is it taking time to shutdown or to bring up?)
> Y.

To bring up according to my notes.

And I am not sure how kexec would work with microcode update. We also
need to upgrade bios sometime :/

> 
> > 
> > 
> > 
> > > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer  > > com>
> > > wrote:
> > > 
> > > > Hi,
> > > > 
> > > > so that's kernel reboot time again, this time courtesy of Intel
> > > > (again). I do not consider the issue to be "OMG the sky is
> > > > falling",
> > > > but enough to take time to streamline our process to reboot.
> > > > 
> > > > 
> > > > 
> > > > Currently, we do not have a policy or anything, and I think the
> > > > negociation time around that is cumbersome:
> > > > - we need to reach people, which take time and add latency
> > > > (would
> > > > be
> > > > bad if that was a urgent issue, and likely add undeed stress
> > > > while
> > > > waiting)
> > > > 
> > > > - we need to keep track of what was supposed to be done, which
> > > > is
> > > > also
> > > > cumbersome
> > > > 
> > > > While that's not a problem if I had only gluster to deal with,
> > > > my
> > > > team
> > > > of 3 do have to deal with a few more projects than 1, and
> > > > orchestrating
> > > > choice for a dozen of group is time consuming (just think last
> > > > time
> > > > you
> > > > had to go to a restaurant after a conference to see how hard it
> > > > is
> > > > to
> > > > reach agreements).
> > > > 
> > > > So I would propose that we simplify that with the following
> > > > policy:
> > > > 
> > > > - Jenkins builder would be reboot by jenkins on a regular
> > > > basis.
> > > > I do not know how we can do that, but given that we have enough
> > > > node to
> > > > sustain builds, it shouldn't impact developpers in a big way.
> > > > The
> > > > only
> > > > exception is the freebsd builder, since we only have 1
> > > > functionnal
> > > > at
> > > > the moment. But once the 2nd is working, it should be treated
> > > > like
> > > > the
> > > > others.
> > > > 
> > > > - service in HA (firewall, reverse proxy, internal squid/DNS)
> > > > would
> > > > be
> > > > reboot during the day without notice. Due to working HA, that's
> > > > non
> > > > user impacting. In fact, that's already what I do.
> > > > 
> > > > - service not in HA should be pushed for HA (gerrit might get
> > > > there
> > > > one
> > > > day, no way for jenkins :/, need to see for postgres and so
> > > > fstat/softserve, and maybe try to get something for
> > > > download.gluster.org)
> > > > 
> > > > - service critical and not in HA should be announced in
> > > > advance.
> > > > Critical mean the service listed here: https://gluster-infra-do
> > > > cs.r
> > > > eadt
> > > > hedocs.io/emergency.html
> > > > 
> > > > - service non visible to end user (backup servers, ansible
> > > > deployment
> > > > etc) can be reboot at will
> > > > 
> > > > Then the only question is what about stuff not in the previous
> > > > category, like softserve, fstat.
> > > > 
> > > > Also, all dependencies are as critical as the most critical
> > > > service
> > > > that depend on them. So hypervisors hosting gerrit/jenkins are
> > > > critical
> > > > (until we find a way to avoid outage), the ones for builders
> > > > are
> > > > not.
> > > > 
> > > > 
> > > > 
> > > > Thoughts, ideas ?
> > > > 
> > > > 
> > > > --
> > > > Michael Scherer
> > > > Sysadmin, Community Infrastructure and Platform, OSAS
> > > > 
> > > > ___
> > > > Gluster-infra mailing list
> > > > gluster-in...@gluster.org
> > > > https://lists.gluster.org/mailman/listinfo/gluster-infra
> > > 
> > > 
> > > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
> > 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
__

[Gluster-devel] Emergency reboot on Gerrit today at 09h45 UTC

2018-08-23 Thread Michael Scherer
Hi,

as said on https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , we
have found that gerrit can't sustain the load when receiving too much
patch at once. We are going to reboot the VM to bump the memory and
CPUs. We already used the hotplug feature of libvirt, so a reboot is
required.

We are going to start that in 1h, so 
UTC 09h45
BER 11h45 
BLR 15h15
TLV 12h45
BOs 05h45

While the downtime is supposed to be minimal (< 5 minutes), years of
experience in this profession taught me that things do not always go
well, so we are getting a 2h windows.

We will send a follow up email for the start of the maintainance, and
the stop. And also remind to people on irc.

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Reboot policy for the infra

2018-08-23 Thread Yaniv Kaul
On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer 
wrote:

> Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> > One more piece that's missing is when we'll restart the physical
> > servers.
> > That seems to be entirely missing. The rest looks good to me and I'm
> > happy
> > to add an item to next sprint to automate the node rebooting.
>
> That's covered as "as critical as the services that depend on them.
>
> Now, the problem I do have is that some server (myrmicinae to name it)
> do take 30 minutes to reboot, and I can't diagnose nor fix without
> taking hours. This is the one running gerrit/jenkins, so that's not
> possible to spent time on this kind of test.
>

You'd imagine people would move to kexec reboots for VMs by now.
Not sure why it's not catching up.
(BTW, is it taking time to shutdown or to bring up?)
Y.


>
>
>
> > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer 
> > wrote:
> >
> > > Hi,
> > >
> > > so that's kernel reboot time again, this time courtesy of Intel
> > > (again). I do not consider the issue to be "OMG the sky is
> > > falling",
> > > but enough to take time to streamline our process to reboot.
> > >
> > >
> > >
> > > Currently, we do not have a policy or anything, and I think the
> > > negociation time around that is cumbersome:
> > > - we need to reach people, which take time and add latency (would
> > > be
> > > bad if that was a urgent issue, and likely add undeed stress while
> > > waiting)
> > >
> > > - we need to keep track of what was supposed to be done, which is
> > > also
> > > cumbersome
> > >
> > > While that's not a problem if I had only gluster to deal with, my
> > > team
> > > of 3 do have to deal with a few more projects than 1, and
> > > orchestrating
> > > choice for a dozen of group is time consuming (just think last time
> > > you
> > > had to go to a restaurant after a conference to see how hard it is
> > > to
> > > reach agreements).
> > >
> > > So I would propose that we simplify that with the following policy:
> > >
> > > - Jenkins builder would be reboot by jenkins on a regular basis.
> > > I do not know how we can do that, but given that we have enough
> > > node to
> > > sustain builds, it shouldn't impact developpers in a big way. The
> > > only
> > > exception is the freebsd builder, since we only have 1 functionnal
> > > at
> > > the moment. But once the 2nd is working, it should be treated like
> > > the
> > > others.
> > >
> > > - service in HA (firewall, reverse proxy, internal squid/DNS) would
> > > be
> > > reboot during the day without notice. Due to working HA, that's non
> > > user impacting. In fact, that's already what I do.
> > >
> > > - service not in HA should be pushed for HA (gerrit might get there
> > > one
> > > day, no way for jenkins :/, need to see for postgres and so
> > > fstat/softserve, and maybe try to get something for
> > > download.gluster.org)
> > >
> > > - service critical and not in HA should be announced in advance.
> > > Critical mean the service listed here: https://gluster-infra-docs.r
> > > eadt
> > > hedocs.io/emergency.html
> > >
> > > - service non visible to end user (backup servers, ansible
> > > deployment
> > > etc) can be reboot at will
> > >
> > > Then the only question is what about stuff not in the previous
> > > category, like softserve, fstat.
> > >
> > > Also, all dependencies are as critical as the most critical service
> > > that depend on them. So hypervisors hosting gerrit/jenkins are
> > > critical
> > > (until we find a way to avoid outage), the ones for builders are
> > > not.
> > >
> > >
> > >
> > > Thoughts, ideas ?
> > >
> > >
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure and Platform, OSAS
> > >
> > > ___
> > > Gluster-infra mailing list
> > > gluster-in...@gluster.org
> > > https://lists.gluster.org/mailman/listinfo/gluster-infra
> >
> >
> >
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Urgent Gerrit reboot today

2018-08-23 Thread Nigel Babu
Hello folks,

We're going to do an urgent reboot of the Gerrit server in the next 1h or
so. For some reason, hot-adding RAM on this machine isn't working, so we're
going to do a reboot to get this working. This is needed to prevent the OOM
Kill problems we've been running into since last night.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Reboot policy for the infra

2018-08-23 Thread Michael Scherer
Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> One more piece that's missing is when we'll restart the physical
> servers.
> That seems to be entirely missing. The rest looks good to me and I'm
> happy
> to add an item to next sprint to automate the node rebooting.

That's covered as "as critical as the services that depend on them.

Now, the problem I do have is that some server (myrmicinae to name it)
do take 30 minutes to reboot, and I can't diagnose nor fix without
taking hours. This is the one running gerrit/jenkins, so that's not
possible to spent time on this kind of test.



> On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer 
> wrote:
> 
> > Hi,
> > 
> > so that's kernel reboot time again, this time courtesy of Intel
> > (again). I do not consider the issue to be "OMG the sky is
> > falling",
> > but enough to take time to streamline our process to reboot.
> > 
> > 
> > 
> > Currently, we do not have a policy or anything, and I think the
> > negociation time around that is cumbersome:
> > - we need to reach people, which take time and add latency (would
> > be
> > bad if that was a urgent issue, and likely add undeed stress while
> > waiting)
> > 
> > - we need to keep track of what was supposed to be done, which is
> > also
> > cumbersome
> > 
> > While that's not a problem if I had only gluster to deal with, my
> > team
> > of 3 do have to deal with a few more projects than 1, and
> > orchestrating
> > choice for a dozen of group is time consuming (just think last time
> > you
> > had to go to a restaurant after a conference to see how hard it is
> > to
> > reach agreements).
> > 
> > So I would propose that we simplify that with the following policy:
> > 
> > - Jenkins builder would be reboot by jenkins on a regular basis.
> > I do not know how we can do that, but given that we have enough
> > node to
> > sustain builds, it shouldn't impact developpers in a big way. The
> > only
> > exception is the freebsd builder, since we only have 1 functionnal
> > at
> > the moment. But once the 2nd is working, it should be treated like
> > the
> > others.
> > 
> > - service in HA (firewall, reverse proxy, internal squid/DNS) would
> > be
> > reboot during the day without notice. Due to working HA, that's non
> > user impacting. In fact, that's already what I do.
> > 
> > - service not in HA should be pushed for HA (gerrit might get there
> > one
> > day, no way for jenkins :/, need to see for postgres and so
> > fstat/softserve, and maybe try to get something for
> > download.gluster.org)
> > 
> > - service critical and not in HA should be announced in advance.
> > Critical mean the service listed here: https://gluster-infra-docs.r
> > eadt
> > hedocs.io/emergency.html
> > 
> > - service non visible to end user (backup servers, ansible
> > deployment
> > etc) can be reboot at will
> > 
> > Then the only question is what about stuff not in the previous
> > category, like softserve, fstat.
> > 
> > Also, all dependencies are as critical as the most critical service
> > that depend on them. So hypervisors hosting gerrit/jenkins are
> > critical
> > (until we find a way to avoid outage), the ones for builders are
> > not.
> > 
> > 
> > 
> > Thoughts, ideas ?
> > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > ___
> > Gluster-infra mailing list
> > gluster-in...@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-infra
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel