Re: [Gluster-infra] Jenkins switched over to new builders for regression

2019-02-08 Thread Nigel Babu
All the RAX builders are now gone. We're running off AWS entirely now.
Please file an infra bug if you notice something odd. For future reference,
logs and cores are going to be available on https://logs.aws.gluster.org
rather than individual build servers. This should, in the future, be
printed in the logs.

On Fri, Feb 8, 2019 at 7:49 AM Nigel Babu  wrote:

> Hello,
>
> We've reached the half way mark in the migration and half our builders
> today are now running on AWS. I've turned off the RAX builders and have
> them try to be online only if the AWS builders cannot handle the number of
> jobs running at any given point.
>
> The new builders are named builder2xx.aws.gluster.org. If you notice an
> infra issue with them, please file a bug. I will be working on adding more
> AWS builders during the day today.
>
> --
> nigelb
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins switched over to new builders for regression

2019-02-07 Thread Nigel Babu
Hello,

We've reached the half way mark in the migration and half our builders
today are now running on AWS. I've turned off the RAX builders and have
them try to be online only if the AWS builders cannot handle the number of
jobs running at any given point.

The new builders are named builder2xx.aws.gluster.org. If you notice an
infra issue with them, please file a bug. I will be working on adding more
AWS builders during the day today.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Regression logs issue

2019-02-07 Thread Nigel Babu
Hello folks,

In the last week, if you have had a regression job that failed, you will
not find a log for it. This is due to a mistake I made while deleting code.
Rather than deleting the code for the push to an internal HTTP server, I
also deleted a line which handled the log creation. Apologies for the
mistake. This has now been corrected and the fix pushed to all regression
nodes. Any future failures should have logs attached as artifacts.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Please do not upgrade the cppcheck Jenkins plugin

2019-01-10 Thread Nigel Babu
Hello folks,

This is a note to myself and everyone else. Please do not upgrade cppcheck
from 1.22. The plugin seems to have changed in a backwards incompatible
manner. For now we'll stick to the 1.22 version until we have to figure out
how to make it work with the latest version.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Infra Update for Nov and Dec

2018-12-19 Thread Nigel Babu
Hello folks,

The infra team has not been sending regular updates recently because we’ve
been caught up in several different pieces of work that were running into
longer than 2 week sprint cycles. This is a summary of what we’ve done so
far since the last update.

* The bugzilla updates are done with a python script now and there’s now a
patch to handle a patch being abandoned and restored. It’s pending a merge
and deploy after the holiday season.
* Smoke jobs for python linting and shell linting.
* Smoke jobs for 32-bit builds.

The big piece that the infra team has been spending time has been working
on is identifying the best way to write end to end testing for GCS (Gluster
for Container Storage). We started with the assumption that we want to use
a test framework that as far as possible sticks closely to the upstream
kubernetes and Openshift Origin tests. We have had a 3-pronged approach to
this over the last two months.

1. We want to use machines we have access to right now to verify that the
deployment scripts that we publish works as we intend for it to work. To
this end, we created a job on Centos CI that consumes the deployment
exactly like we recommend anyone run the scripts in the gcs repository[1].
We’re running into a couple of failures and Mrugesh is working on
identifying and fixing them. We hope to have this complete in the first
week of January.
2. We want to use the upstream end to end test framework that consumes
ginkgo and gomega. The framework already exists to consume the kubectl
client to talk to a kubernetes cluster. We’ve just had a conversation with
the upstream Storage-SIG developers yesterday that has pointed us in the
right direction. We’re very close to having a first test. When the first
test in the end to end framework comes about, we’ll hook it up to the test
run we have in (1). Deepshikha and I are actively working on making this
happen. We plan to have a proof of concept in the second week of January
and write documentation and demos for the GCS team.
3. We want to do some testing that actively tries to break a production
sized cluster and look for how our stack handles failures. There’s a longer
plan on how to do this, but this work is currently on hold until we get the
first two pieces running. This is also blocked on us having access to
infrastructure where we can make this happen. Mrugesh will lead this
activity once the other blockers are removed.

Once we have the first proof of concept test written, we will hand over
writing the tests to the GCS development team and the infra team will then
move to working on building out the infrastructure for running these new
tests. We will continue to work in close collaboration with the Kubernetes
Storage SIG and the OKD Infrastructure teams to prevent us from duplicating
work.

[1]: https://ci.centos.org/view/Gluster/job/gluster_anteater_gcs/


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Github notifications for spec reviews

2018-11-30 Thread Nigel Babu
Hello folks,

We've had a bug to automate adding a comment on Github when there is a new
spec patch. I'm going to deny this request.

* The glusterfs-specs repo does not have an issue tracker and does not seem
to ever need an issue tracker. We currently limit pre-merge commenting on
Github to the repos where you have referenced a commit. That is, on
glusterfs repo if you reference a bug on glusterd2, we will not comment on
the GD2 issue. This is done to prevent noise on external repos. The commit
only becomes relevant to the external repo when it's merged.
* A spec is only a proposal for a change. The spec only makes sense when
it's been reviewed and merged. At the point when it's merged, Github will
pick it up anyway.

For these reasons, I'm going to close this bug as WONTFIX. If you feel
strongly about it, please let me know potential use cases for this behavior.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1557127

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Short review.gluster.org outage in the next 15 mins

2018-11-05 Thread Nigel Babu
Hello folks,

Going to restart gerrit on review.gluster.org for a quick config change in
the next 15 mins. Estimate outage of 5 mins. I'll update this thread when
we're back online

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Centos CI automation Retrospective

2018-11-02 Thread Nigel Babu
Oops, missed finishing a line.

Please avoid making any changes directly via the Jenkins UI going forward.
Any configuration changes need to be made from the repo so the config
drives Jenkins.

On Fri, Nov 2, 2018 at 11:32 AM Nigel Babu  wrote:

> Hello folks,
>
> On Monday, I merged in the changes that allowed all the jobs in Centos CI
> to be handled in an automated fashion. In the past, it depended on Infra
> team members to review, merge, and apply the changes on Centos CI. I've now
> changed that so that the individual job owners can do their own merges.
>
> 1. On sending a pull request, a travis-ci job will ensure the YAML is
> valid JJB.
> 2. On merge, we'll apply the changes to ci.centos.org with travis-ci.
>
> We had a few issues when we did this change. This was expected, but it
> took more time than I anticipated to fix all of them up.
>
> Notably, the GD2 CI issues did not get fixed up until today. This was
> because the status context was not defined in the yaml file, but only on
> the UI. Please avoid making  However, I can now confirm that all jobs are
> working exactly off their source yaml. Thanks to Kaushal and Madhu for
> working me on solving  this issue. Apologies for the inconvenience caused.
> If you have a pull request that did not seem to get CI to work, please send
> an update with a cosmetic change. That should retrigger CI correctly.
>
> If you notice anything off, please file an infra bug and we'll by happy to
> help.
>
> --
> nigelb
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Centos CI automation Retrospective

2018-11-02 Thread Nigel Babu
Hello folks,

On Monday, I merged in the changes that allowed all the jobs in Centos CI
to be handled in an automated fashion. In the past, it depended on Infra
team members to review, merge, and apply the changes on Centos CI. I've now
changed that so that the individual job owners can do their own merges.

1. On sending a pull request, a travis-ci job will ensure the YAML is valid
JJB.
2. On merge, we'll apply the changes to ci.centos.org with travis-ci.

We had a few issues when we did this change. This was expected, but it took
more time than I anticipated to fix all of them up.

Notably, the GD2 CI issues did not get fixed up until today. This was
because the status context was not defined in the yaml file, but only on
the UI. Please avoid making  However, I can now confirm that all jobs are
working exactly off their source yaml. Thanks to Kaushal and Madhu for
working me on solving  this issue. Apologies for the inconvenience caused.
If you have a pull request that did not seem to get CI to work, please send
an update with a cosmetic change. That should retrigger CI correctly.

If you notice anything off, please file an infra bug and we'll by happy to
help.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Maintaining gluster/centosci repo

2018-10-29 Thread Nigel Babu
On Mon, Oct 29, 2018 at 3:00 PM Michael Adam  wrote:

>
>
> On Mon, Oct 29, 2018 at 10:09 AM Nigel Babu  wrote:
>
>> This patch was merged today.
>>
>
> Sorry, but what does "This" refer to?
>
>
The .travis.yml change with a test and deployment script.

https://github.com/gluster/centosci/pull/27

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Gluster Infra Update

2018-10-18 Thread Nigel Babu
Hello folks,

Here's the update from the last 2 weeks from the Infra team.

* Created an architecture document for Automated Upgrade Testing. This is
now done and is undergoing reviews. It is scheduled to be published on the
devel list as soon as we have a decent PoC.
* Finished part of the migration of the bugzilla handling scripts to
python[1]. Sanju discovered a bug[2], so it's been rolled back. We're going
to add the ability to handle an external tracker as well while we fix the
bug.
* Softserve's SSH key handling is better[3]. You no longer have to paste an
SSH key into softserve as long as you have that key on Github. Softserve
will pick up the key from Github and auto-populate that field for you.
* Thanks to Sheersha's work we have a CI job[4] for gluster-ansible-infra
now.
* We're decentralizing the responsibility for handling Centos CI jobs[5].

[1]:
https://github.com/gluster/glusterfs-patch-acceptance-tests/blob/master/github/handle_bugzilla.py
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1636455
[3]: https://github.com/gluster/softserve/pull/48
[4]: https://github.com/gluster/centosci/pull/23
[5]:
https://lists.gluster.org/pipermail/gluster-infra/2018-October/005155.html

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Reducing the number of builders in the cage

2018-10-15 Thread Nigel Babu
I think it might we worth pulling out some utilization numbers to see how
many to pull. If we can get the freebsd builder working, that would
eliminate the need to run it on rackspace and having two of them would
increase the speed at which we process the smoke queue.

On Mon, Oct 15, 2018 at 5:32 PM Michael Scherer  wrote:

> Le lundi 15 octobre 2018 à 15:29 +0530, Sankarshan Mukhopadhyay a
> écrit :
> > On Mon, Oct 15, 2018 at 3:19 PM Michael Scherer 
> > wrote:
> >
> > > so we currently have 50 builders in the cage, and I think that's
> > > too
> > > much. While that's not a huge issue, having too much VMs do cause
> > > slowdown on ansible, consume ressources for nothing (disk space,
> > > CPU,
> > > bw). When I look on the graph on https://munin.gluster.org/ , there
> > > isn't much load.
> > >
> > > So I would like to start remove some of them and see how that go.
> >
> > What is the value of this "some of them"? 10%, 20% ...
>
> It depend on the builders, but I think i would remove 10% for a start,
> so around 4.
>
> Now, not all builders are equal, I am not gonna remove the debian nor
> the freebsd one, of course.
>
> I think the easiest would be to mark them offline in jenkins, see if
> that create issues (such as the queue becoming large), and if nothing
> happen, remove them.
>
> AFAIK, thoses are mostly here for smoke tests, format, warnings, etc.
> So fast jobs.
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Maintaining gluster/centosci repo

2018-10-12 Thread Nigel Babu
Hello folks,

The centosci repo keeps falling behind in terms of reviews and merges +
delay in applying the merges on ci.centos.org. I'd like to propose the
following to change that. This change will impact everyone who runs a job
on Centos CI.

* As soon as you merge a patch into that repo, we will apply that patch on
Centos CI using Jenkins/Travis (don't really care which one).
* Every team that has a job will have at least one committer (preferably
more than 1). Please feel free to review and merge patches as long as it
only applies to your job. If you want to add new committers.
* If you need to create a new job, you can ask us for initial review, but
the rest can be handled by your team independently.
* If you want an old job deleted, please file a bug.

Does this sound acceptable? I'm going to deploy a CI job to apply master on
Centos CI on 29th. Please nominate folks from your teams who need explicit
commit access. The first day might be choppy in case there's a diff between
what's in ci.centos.org vs what's on the repo.


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Infra Update for the last 2 weeks

2018-10-03 Thread Nigel Babu
Hello folks,

I meant to send this out on Monday, but it's been a busy few days.
* The infra pieces of distributed regression are now complete. A big shout
out to Deepshikha for driving this and Ramky for his help in get this to
completion.
* The GD2 containers and CSI container builds work now. We still don't know
why it broke or why it started working again. We're tracking this in a
bug[1].
* Gluster-Infra now has a Sentry.io account, so we discover issues with
softserve or fstat very quickly and are able to debug it very quickly.
* We're restarting our efforts to get a nightly Glusto job going and are
running into test failures. Currently debugging them for actual failures vs
infra issues.
* The infra team has been assisting gluster-ansible on and off to help them
build out a set of tests. This has been going steady and now waiting on
Infra team to setup CI with Centos-CI team.
* From this sprint on, we're going to be spending some time triaging out
the infra bugs so they're assigned and in the correct state.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1626453

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Repository needed - cockpit-gluster

2018-10-02 Thread Nigel Babu
If your github handle is rohantmp, you should already have access. If that
isn't your handle, please give me your handle. For future reference, please
use a bug.

On Thu, Sep 27, 2018 at 5:15 PM Rohan Joseph  wrote:

> Hi!
>
> Can I get access to change the description?
>
> On Mon, Aug 20, 2018 at 4:21 PM, Nigel Babu  wrote:
>
>> Yep. Done.
>>
>> On Mon, Aug 20, 2018 at 4:05 PM Sahina Bose  wrote:
>>
>>> Hi Nigel,
>>>
>>> I've raised a bug for repository creation -
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1619205
>>>
>>> Could you help?
>>>
>>> thanks
>>> sahina
>>>
>>
>>
>> --
>> nigelb
>>
>
>

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Unplanned Jenkins maintenance

2018-09-28 Thread Nigel Babu
Hello folks,

I did a quick unplanned Jenkins maintenance today to upgrade 3 plugins with
security issues in them. This is now complete. There was a brief period
where we did not start new jobs until Jenkins restarted. There should have
been no interruption of existing jobs or any jobs canceled. Please file a
bug if you notice something wrong post-upgrade.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Freebsd builder upgrade to 10.4, maybe 11

2018-09-11 Thread Nigel Babu
On Tue, Sep 11, 2018 at 7:06 PM Michael Scherer  wrote:

> And... rescue mode is not working. So the server is down until
> Rackspace fix it.
>
> Can someone disable the freebsd smoke test, as I think our 2nd builder
> is not yet building fine ?
>


Disabled. Please do not merge any JJB review requests until this is fixed.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Urgent Gerrit reboot today

2018-08-23 Thread Nigel Babu
Hello folks,

We're going to do an urgent reboot of the Gerrit server in the next 1h or
so. For some reason, hot-adding RAM on this machine isn't working, so we're
going to do a reboot to get this working. This is needed to prevent the OOM
Kill problems we've been running into since last night.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Reboot policy for the infra

2018-08-22 Thread Nigel Babu
One more piece that's missing is when we'll restart the physical servers.
That seems to be entirely missing. The rest looks good to me and I'm happy
to add an item to next sprint to automate the node rebooting.

On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer  wrote:

> Hi,
>
> so that's kernel reboot time again, this time courtesy of Intel
> (again). I do not consider the issue to be "OMG the sky is falling",
> but enough to take time to streamline our process to reboot.
>
>
>
> Currently, we do not have a policy or anything, and I think the
> negociation time around that is cumbersome:
> - we need to reach people, which take time and add latency (would be
> bad if that was a urgent issue, and likely add undeed stress while
> waiting)
>
> - we need to keep track of what was supposed to be done, which is also
> cumbersome
>
> While that's not a problem if I had only gluster to deal with, my team
> of 3 do have to deal with a few more projects than 1, and orchestrating
> choice for a dozen of group is time consuming (just think last time you
> had to go to a restaurant after a conference to see how hard it is to
> reach agreements).
>
> So I would propose that we simplify that with the following policy:
>
> - Jenkins builder would be reboot by jenkins on a regular basis.
> I do not know how we can do that, but given that we have enough node to
> sustain builds, it shouldn't impact developpers in a big way. The only
> exception is the freebsd builder, since we only have 1 functionnal at
> the moment. But once the 2nd is working, it should be treated like the
> others.
>
> - service in HA (firewall, reverse proxy, internal squid/DNS) would be
> reboot during the day without notice. Due to working HA, that's non
> user impacting. In fact, that's already what I do.
>
> - service not in HA should be pushed for HA (gerrit might get there one
> day, no way for jenkins :/, need to see for postgres and so
> fstat/softserve, and maybe try to get something for
> download.gluster.org)
>
> - service critical and not in HA should be announced in advance.
> Critical mean the service listed here: https://gluster-infra-docs.readt
> hedocs.io/emergency.html
>
> - service non visible to end user (backup servers, ansible deployment
> etc) can be reboot at will
>
> Then the only question is what about stuff not in the previous
> category, like softserve, fstat.
>
> Also, all dependencies are as critical as the most critical service
> that depend on them. So hypervisors hosting gerrit/jenkins are critical
> (until we find a way to avoid outage), the ones for builders are not.
>
>
>
> Thoughts, ideas ?
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Repository needed - cockpit-gluster

2018-08-20 Thread Nigel Babu
Yep. Done.

On Mon, Aug 20, 2018 at 4:05 PM Sahina Bose  wrote:

> Hi Nigel,
>
> I've raised a bug for repository creation -
> https://bugzilla.redhat.com/show_bug.cgi?id=1619205
>
> Could you help?
>
> thanks
> sahina
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

2018-08-15 Thread Nigel Babu
On Wed, Aug 15, 2018 at 2:41 PM Michael Scherer  wrote:

> Hi folks,
>
> So Gluster jenkins disk was full today (cause outages do not respect
> public holiday in India (Independance day) and France(Assumption)),
> here is the post mortem for your reading pleasure
>
> Date: 15/08/2018
>
> Service affected:
>   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
>
> Impact:
>
>   No jenkins job could be triggered.
>
> Root cause:
>
>   A disk full mainly because we got new jobs and more patches, so
> regular growth.
>
> Resolution:
>
>   Increased the disk by 30G, and investigating if cleanup could be
>   improved. This did require a reboot.
>
>
> Involved people:
> - misc
> - nigel
>
> Lessons learned
> - What went well:
>   - we had a documented process for that, and good enough to be used by
> a tired admin.
>
> - What went bad:
>   - we weren't proactive enough to see that before it caused a outage
>   - 15 of August is a holiday for both France and India. Technically,
> none of the infra team should have been up.
>
> - When we were lucky
>   - It was a day off in India, so few people were affected, except
> folks who continue to work on days off
>   - Misc decided to go to work while being in Brno to take days off
> later
>
>
> Timeline (in UTC)
>
> - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
> https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
> ml
> 
>
> - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
> away from laptop for Independence day celebration.
>
> - 06:24 Misc do not hear the ding since he is asleep
>
> - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
> ug.cgi?id=1616160 
>
> - 06:56 Misc do not see the email since he is still asleep
>
> - 07:13 Misc wake up, see a blinking light on the phone and ponder
> about closing his eyes again. He look at it, and start to swear.
>
> - 07:14 Investigation reveal that Jenkins partition is full (100%). A
> quick investigation do not yield any particular issues. The Jenkins
> jobs are taking space and that's it.
>
> - 07:19 After discussion with Nigel, it is decided to increase the size
> of the partition. Misc take a look at it, try to increase without any
> luck. The server is rebooted in case that's what was needed. Still not
> enough.
>
> - 07:25 Misc go quickly shower to wake him up. The warm embrace of
> water make him remember that a documentation on that process do exist:
>
> https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
> n.html
> 
>
> - 07:30  Following the documentation, we discover that the hypervisor
> is now out of space for future increase. Looking at that will be done
> after the post mortem.
>
> - 07:37 Jenkins is being restarted, with more space, and seems to work
> ok.
>
> - 07:38 Misc rush to his hotel breakfast who close at 10.
>
> - 09:09 Post mortem is finished and being sent
>
>
> Action items:
> - (misc) see what can be done for myrmicinae (the hypervisor where
> jenkins is running) since there is no more space.
>
> Potential improvement to make:
> - we still need to have monitoring in place
> - we need to move munin in the internal lan for looking at the graph
> for jenkins
> - documentation regarding resizing could be clearer, notably on volume
> resizing part
>

This is highlighting that we need to solve
https://bugzilla.redhat.com/show_bug.cgi?id=1564372 on priority. The lack
of monitoring is affecting day to day work.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Looks like glusterfs's smoke job is not running for the patches posted

2018-08-15 Thread Nigel Babu
This is something I've highlighted in the past. If you trigger regression
and smoke at the same time, smoke will only vote after regression job is
done. That's Jenkins optimizing the communication with Gerrit so it needs
to do the voting only once. This is a feature and not a bug.

On Wed, Aug 15, 2018 at 11:32 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> The job triggered for me but not flagged +1
>
> Reference: https://review.gluster.org/#/c/glusterfs/+/20548/
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Setting up machines from softserve in under 5 mins

2018-08-13 Thread Nigel Babu
Hello folks,

Deepshikha did the work to make loaning a machine to running your
regressions on them faster a while ago. I've tested them a few times today
to confirm it works as expected. In the past, Softserve[1] machines would
be a clean Centos 7 image. Now, we have an image with all the dependencies
installed and *almost* setup to run regressions. It just needs a few steps
run on them and we have a simplified playbook that will setup *just* those
steps. This brings down the time from around 30 mins to setup a machine to
less than 5 mins. The instructions[2] are on the softserve wiki for now,
but will move to the site itself in the future.

Please let us know if you face troubles by filing a bug.[3]
[1]: https://softserve.gluster.org/
[2]:
https://github.com/gluster/softserve/wiki/Running-Regressions-on-loaned-Softserve-instances
[3]:
https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS=project-infrastructure

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Post-upgrade issues

2018-08-08 Thread Nigel Babu
Hello folks,

We have two post-upgrade issues

1. Jenkins jobs are failing because git clones fail. This is now fixed.
2. git.gluster.org shows no repos at the moment. I'm currently debugging
this.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Fwd: Gerrit downtime on Aug 8, 2016

2018-08-08 Thread Nigel Babu
On Wed, Aug 8, 2018 at 4:59 PM Yaniv Kaul  wrote:

>
> Nice, thanks!
> I'm trying out the new UI. Needs getting used to, I guess.
> Have we upgraded to NotesDB?
>

Yep! Account information is now completely in NoteDB and not in
ReviewDB(which is backed by postgresql for us) anymore.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Fwd: Gerrit downtime on Aug 8, 2016

2018-08-07 Thread Nigel Babu
Reminder, this upgrade is tomorrow.

-- Forwarded message -
From: Nigel Babu 
Date: Fri, Jul 27, 2018 at 5:28 PM
Subject: Gerrit downtime on Aug 8, 2016
To: gluster-devel 
Cc: gluster-infra , <
automated-test...@gluster.org>


Hello,

It's been a while since we upgraded Gerrit. We plan to do a full upgrade
and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
interface which brings significant frontend changes. You can take a look at
how this would look on the staging site[1].

## Outage Window
0330 EDT to 0730 EDT
0730 UTC to 1130 UTC
1300 IST to 1700 IST

The actual time needed for the upgrade is about than hour, but we want to
keep a larger window open to rollback in the event of any problems during
the upgrade.

-- 
nigelb


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Coverity on build nodes

2018-08-06 Thread Nigel Babu
I've just added two nodes for Coverity. The tarball is on build.gluster.org,
but the process of setting it up on a node is pretty manual at the moment.
Ideally, I'd like an internal only server from which we can download
private binaries that we can distribute.

The tar has been extracted to /opt and we'll create builds off it for now.
But I'm highlighting this for future automation. Let's figure out what's
the best way to automate it and then file bugs for the actions.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Master branch is closed

2018-08-05 Thread Nigel Babu
Hello folks,

Master branch is now closed. Only a few people have commit access now and
it's to be exclusively used to merge fixes to make master stable again.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Gerrit downtime on Aug 8, 2016

2018-07-28 Thread Nigel Babu
FYI: There is an issue with seeing diffs on staging. I've root caused this
to a bug in our apache configuration for Gerrit. This is more tricky than I
want to handle at the moment, but I'm aware of the problem and tested out a
fix. We'll fix it more permanently in ansible on Monday. My fix will get
overwritten by Ansible tonight :)

On Fri, Jul 27, 2018 at 5:28 PM Nigel Babu  wrote:

> Hello,
>
> It's been a while since we upgraded Gerrit. We plan to do a full upgrade
> and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
> interface which brings significant frontend changes. You can take a look at
> how this would look on the staging site[1].
>
> ## Outage Window
> 0330 EDT to 0730 EDT
> 0730 UTC to 1130 UTC
> 1300 IST to 1700 IST
>
> The actual time needed for the upgrade is about than hour, but we want to
> keep a larger window open to rollback in the event of any problems during
> the upgrade.
>
> --
> nigelb
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [automated-testing] Gerrit downtime on Aug 8, 2016

2018-07-27 Thread Nigel Babu
Ah, apologies.

Staging URL: http://gerrit-stage.rht.gluster.org/

If you want to try out PolyGerrit, the new UI, click on the footer of the
page that says "Switch to new UI".

On Fri, Jul 27, 2018 at 5:46 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> The staging URL seems to be missing from the note
>
> On Fri, Jul 27, 2018 at 5:28 PM, Nigel Babu  wrote:
> > Hello,
> >
> > It's been a while since we upgraded Gerrit. We plan to do a full upgrade
> and
> > move to 2.15.3. Among other changes, this brings in the new PolyGerrit
> > interface which brings significant frontend changes. You can take a look
> at
> > how this would look on the staging site[1].
> >
> > ## Outage Window
> > 0330 EDT to 0730 EDT
> > 0730 UTC to 1130 UTC
> > 1300 IST to 1700 IST
> >
> > The actual time needed for the upgrade is about than hour, but we want to
> > keep a larger window open to rollback in the event of any problems during
> > the upgrade.
> >
> > --
> > nigelb
>
> --
> sankarshan mukhopadhyay
> <https://about.me/sankarshan.mukhopadhyay>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Gerrit downtime on Aug 8, 2016

2018-07-27 Thread Nigel Babu
Hello,

It's been a while since we upgraded Gerrit. We plan to do a full upgrade
and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
interface which brings significant frontend changes. You can take a look at
how this would look on the staging site[1].

## Outage Window
0330 EDT to 0730 EDT
0730 UTC to 1130 UTC
1300 IST to 1700 IST

The actual time needed for the upgrade is about than hour, but we want to
keep a larger window open to rollback in the event of any problems during
the upgrade.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
On Wed, Jul 25, 2018 at 6:51 PM Niels de Vos  wrote:

> We had someone working on starting/stopping Jenkins slaves in Rackspace
> on-demand. He since has left Red Hat and I do not think the infra team
> had a great interest in this either (with the move out of Rackspace).
>
> It can be deleted from my point of view.
>

FYI, stopping a cloud server does not mean we don't get charged for it. So
I don't know if it was a useful exercise to begin with.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
> So while cleaning thing up, I wonder if we can remove this one:
> https://github.com/gluster/jenkins-ssh-slaves-plugin
>
> We have just a fork, lagging from upstream and I am sure we do not use
> it.
>

Safe to delete. We're not using it for sure.


>
> The same goes for:
> https://github.com/gluster/devstack-plugins
>
> since I think openstack did change a lot, that seems like some internal
>  configuration for dev, I guess we can remove it ?
>

This one seems ahead of the original fork, but I'd say delete.


>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
I think our team structure on Github has become unruly. I prefer that we
use teams only when we can demonstrate that there is a strong need. At the
moment, the gluster-maintainers and the glusterd2 projects have teams that
have a strong need. If any other repo has a strong need for teams, please
speak up. Otherwise, I suggest we delete the teams and add the relevant
people as collaborators on the project.

It should be safe to delete the gerrit-hooks repo. These are now Github
jobs. I'm not in favor of archiving the old projects if they're going to be
hidden from someone looking for it. If they just move to the end of the
listing, it's fine to archive.

On Fri, Jun 29, 2018 at 10:26 PM Michael Scherer 
wrote:

> Le vendredi 29 juin 2018 à 14:40 +0200, Michael Scherer a écrit :
> > Hi,
> >
> > So, after Gentoo hack, I started to look at all our teams on github,
> > and what access does everybody have, etc, etc
> >
> > And I have a few issues:
> > - we have old repositories that are no longer used
> > - we have team without description
> > - we have people without 2FA who are admins of some team
> > - github make this kind of audit really difficult without scripting
> > (and the API is not stable yet for teams)
> >
> > So I would propose the following rules, and apply them in 1 or 2
> > weeks
> > time.
> >
> > For projects:
> >
> > - archives all old projects, aka, ones that got no commit since 2
> > years, unless people give a reason for the project to stay
> > unarchived.
> > Being archived do not remove it, it just hide it by default and set
> > it
> > readonly. It can be reverted without trouble.
> >
> > See https://help.github.com/articles/archiving-a-github-repository/
> >
> > - remove project who never started ("vagrant" is one example, there
> > is
> > only one readme file).
> >
> > For teams:
> > - if you are admin of a team, you have to turn on 2FA on your
> > account.
> > - if you are admin of the github org, you have to turn 2FA.
> >
> > - if a team no longer have a purpose (for example, all repos got
> > archived or removed), it will be removed.
> >
> > - add a description in every team, that tell what kind of access does
> > it give.
> >
> >
> > This would permit to get a bit more clarity and security.
>
> So to get some perspective after writing a script to get the
> information, the repos I propose to archive:
>
> Older than 3 years, we have:
>
> - gmc-target
> - gmc
> - swiftkrbauth
> - devstack-plugins
> - forge
> - glupy
> - glusterfs-rackspace-regression-tester
> - jenkins-ssh-slaves-plugin
> - glusterfsiostat
>
>
> Older than 2 years, we have:
> - nagios-server-addons
> - gluster-nagios-common
> - gluster-nagios-addons
> - mod_proxy_gluster
> - gluster-tutorial
> - gerrit-hooks
> - distaf
> - libgfapi-java-io
>
> And to remove, because empty:
> - vagrant
> - bigdata
> - gluster-manila
>
>
> Once they are archived, I will take care of the code for finding teams
> to remove.
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Postmortem for Jenkins Outage on 20/07/18

2018-07-20 Thread Nigel Babu
Hello folks,

I had to take down Jenkins for some time today. The server ran out of space
and was silently ignoring Gerrit requests for new jobs. If you think one of
your jobs needed a smoke or regression run and it wasn't triggered, this is
the root cause. Please retrigger your jobs.

## Summary of Impact
Jenkins jobs not triggered intermittently in the last couple of days. At
the moment, we do not have numbers on how many developers were affected by
this. This would be mitigated slightly every day due to the rotation rules
we have in place causing issues only around evening IST when we retrigger
our regular nightly jobs.

## Timeline of Events.
July 19 evening: I've noticed since yesterday that occasionally Jenkins
would not trigger a job for a push. This was on the build-jobs repo. I
chalked it to a signal getting lost in the noise and decided to debug
later. I could trigger it manually, so I put as a thing to do in the
morning. Today morning, I found that jobs are getting triggered as they
should and could not notice anything untoward.

July 20 6:41 pm: Kotresh pinged me asking if there was a problem. I could
see the problem I noticed yesterday in his job. This time a manual trigger
did not work. Around the same time Raghavendra Gowdappa also hit the same
problem. I logged into the server to notice that the Jenkins partition was
out of space.

July 20 7:40 pm: Jenkins is back online completely. A retrigger of the two
failing jobs have been successful.

## Root Cause
* Out of disk space on the Jenkins partition on build.gluster.org
* The bugzilla-post did not delete old jobs and we had about 7000 jobs in
there consuming about 20G of space.
* clang-scan job consumes about 1G per job and we were storing about 30
days worth of archives.

## Resolution
* All centos6-regression jobs are now deleted. We moved over to
centos7-regression a while ago.
* We now only store 7 days of archives for bugzilla-post and clang-scan jobs

## Future Recommendation
* Our monitoring did not alert us about the disk being filled up on the
Jenkins node. Ideally, we should have gotten a warning when we were at
least 90% full so we could plan for additional capacity or look for
mistakes in patterns.
* All jobs need to have a property that discards old runs with the maxmium
of 90 days being kept in case it's absolutely needed. This is currently not
enforced by CI but we will plan to enforce it in the future.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Gerrit and postgresql replica

2018-06-30 Thread Nigel Babu
Hello,

I think the various pieces around infra have stabilized enough for us to
think about this. I suggest that we think about having a Gerrit replica in
the cloud (whichever clouds the CI consumes). This gives us a fall back
option in case the cage has problems. It also gives us a good way to reduce
the CI related load on the main Gerrit server. In the near future, when we
run distributed testing, we're going to clone 10x as much as we do now.
Right now we clone over git to take the load away from Gerrit, but when we
have a replica, I vote we clone over HTTP(s).

I would also recommend an offsite PostgreSQL replica that will let us be
somewhat fault tolerant. In the event that cage has a multi-hour
unexplained outage, we'd be able to bring back essential services.

This is suggestion. We'll need to estimate the cost of work involved + cost
of operating both these hot standbys.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Fedora builds and rawhide builds

2018-06-19 Thread Nigel Babu
Hello,

We ran into a problem where builds for F28 and above will not build on
CentOS7 chroots. We caught this when F28 was rawhide but deemed it not yet
important enough to fix, however, recent developments have forced us to
make the switch. Our Fedora builds will also switch to using F28.

We have 10 new builders builder{40..49}.int.rht.gluster.org, all of which
run F28. These will be currently used for Fedora builds (they build with
libtirpc and rpcgen) and for the upcoming clang-format jobs.

Please let us know if you notice anything wrong the voting patterns for
smoke jobs.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] 'Clone with commit-msg hook' produces wrong scp command

2018-06-18 Thread Nigel Babu
Hi Yaniv,

This was because we forward port 22 to port 29418.

I just changed the sshd.advertisedAddress to say review.gluster.org:22.
That did the trick. Thanks for bringing this to our attention.

On Mon, Jun 18, 2018 at 9:01 PM, Yaniv Kaul  wrote:

> When I choose 'clone with a commit-msg hook' in Gerrit, I get the
> following scp command:
> git clone ssh://myk...@review.gluster.org/glusterfs-specs && scp -p *-P
> 29418* myk...@review.gluster.org:hooks/commit-msg
> glusterfs-specs/.git/hooks/
>
> See the part in bold in the scp command - it points to use port 29418.
> This is incorrect in review.gluster.org.
> The standard port does work.
>
> TIA,
> Y.
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Reminder OUTAGE Today 0800 EDT / 1200 UTC / 1730 IST

2018-05-14 Thread Nigel Babu
Hello,

This is a reminder that we have a an outage today at the community cage
outage window. The switches and routers will be getting updated and
rebooted. This will cause an outage for a short period of time.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Fwd: Planned Network Outage in Community Cage on May 15

2018-05-11 Thread Nigel Babu
Hello folks,

There is a  15--minute cage outage on Tuesday 15th May. Jenkins and Gerrit
will be affected by this outage.

On Tuesday May 15th, there will be a brief (~15 minutes) network
> outage in the Community Cage to allow for software upgrades on our
> network equipment. The outage will occur during our regular 12:00 to
> 16:00 UTC change window.
>
> All community cage tenants, both with physical hardware and with
> virtual machines or services hosted from the cage (see
> https://osci.io/infra_and_services/) will be impacted.
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Unplanned Jenkins restart

2018-04-16 Thread Nigel Babu
Hello folks,

I've just restarted Jenkins for an security update to a plugin. There was
one running centos-regression job that I had to cancel.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins upgrade today

2018-04-10 Thread Nigel Babu
Hello folks,

There's a Jenkins security fix scheduled to be released today. This will
most likely happen in the morning EDT. The Jenkins team has not specified a
time. When we're ready for an upgrade, I'll cancel all running jobs and
re-trigger them at te end of the upgrade. The downtime should be less than
15 mins.

Please bear with us as we continue to ensure that build.gluster.org has the
latest security fixes.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins restart on Tuesday (27)

2018-03-21 Thread Nigel Babu
Hello folks,

On Tuesday morning IST, I'll be upgrading and restarting build.gluster.org
for an upcoming Jenkins plugin security issue related upgrade.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Announcing Softserve- serve yourself a VM

2018-03-20 Thread Nigel Babu
Please file an issue for this:
https://github.com/gluster/softserve/issues/new

On Tue, Mar 20, 2018 at 1:57 PM, Sanju Rakonde <srako...@redhat.com> wrote:

> Hi Nigel,
>
> I have a suggestion here. It will be good if we have a option like request
> for extension of the VM duration, and the option will be automatically
> activated after 3 hours of usage of VM. If somebody is using the VM after 3
> hours and they feel like they need it for 2 more hours they will request to
> extend the duration by 1 more hour. It will save the time of engineering
> since if a machine is expired, one has to configure the machine and all
> other stuff from the beginning.
>
> Thanks,
> Sanju
>
> On Tue, Mar 13, 2018 at 12:37 PM, Nigel Babu <nig...@redhat.com> wrote:
>
>>
>> We’ve enabled certain limits for this application:
>>>>
>>>>1.
>>>>
>>>>Maximum allowance of 5 VM at a time across all the users. User have
>>>>to wait until a slot is available for them after 5 machines allocation.
>>>>2.
>>>>
>>>>User will get the requesting machines maximum upto 4 hours.
>>>>
>>>>
>>> IMHO ,max cap of 4 hours is not sufficient. Most of the times, the
>>> reason of loaning a machine is basically debug a race where we can't
>>> reproduce the failure locally what I have seen debugging such tests might
>>> take more than 4 hours. Imagine you had done some tweaking to the code and
>>> you're so close to understand the problem and then the machine expires,
>>> it's definitely not a happy feeling. What are the operational challenges if
>>> we have to make it for atleast 8 hours or max a day?
>>>
>>
>> The 4h cap was kept so that multiple people could have a chance to debug
>> their test failures on the same day. Pushing the cap to 8h means that if
>> you don't have a machine to loan when you start work one will not be
>> available until the next day. At this point, we'll not be increasing the
>> timeout. So far, we've had one person actually hit this. I'd like to see
>> more data points before we make an application level change.
>>
>> --
>> nigelb
>>
>> ___
>> Gluster-devel mailing list
>> gluster-de...@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Distributed Testing and Memory issues

2018-03-17 Thread Nigel Babu
Hey Karthik,

Deepshikha has been working on testing the distributed test framework that
you contributed (thank you!). Instead of writing our own code to chunk the
tests, we've decided to just consume what you've written so we can work on
making it run both at FB and upstream.

We're running into MemoryError exception from the threads. Do you know
what's the best way to debug or let us know how much memory your machines
have? That'll help us figure out solving this sooner upstream.

PS: This email is CC'd to gluster-infra and is archived publicly.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Fwd: Query regarding coverity scan

2018-03-15 Thread Nigel Babu
Hi Kaleb,

Do you know what's going wrong with Coverity jobs?

-- Forwarded message --
From: Varsha Rao <va...@redhat.com>
Date: Thu, Mar 15, 2018 at 10:15 AM
Subject: Query regarding coverity scan
To: Nigel Babu <nb...@redhat.com>


Hello Nigel,

I have been observing the coverity scans generated for the past few days.
I did not find the html page with the list of all errors as it used to be.

https://download.gluster.org/pub/gluster/glusterfs/static-
analysis/master/glusterfs-coverity/2018-03-14-f32f85c4/html/

https://download.gluster.org/pub/gluster/glusterfs/static-
analysis/master/glusterfs-coverity/2018-03-12-a96c7e74/html/

Thanks,
Varsha



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] gluster-ant is now admin on synced repos

2018-03-15 Thread Nigel Babu
Hello,

If there's a repo that's synced from Gerrit to Github, gluster-ant is now
admin on those repos. This is so that when issues are closed via commit
message, it is closed by the right user (the bot). Rather than the Infra
person who set that repo up.

As always, please file a bug if you notice any problems.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Please help test Gerrit 2.14

2018-03-04 Thread Nigel Babu
Hello,

It's that time again. We need to move up a Gerrit release. Staging has now
been upgraded to the latest version. Please help test it and give us
feedback on any issues you notice: https://gerrit-stage.rht.gluster.org/

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Continuous tests failure on Fedora RPM builds

2018-03-02 Thread Nigel Babu
This is now fixed. Shyam found the root case. After a mock upgrade, mock
would wait for user confirmation that DNF wasn't installed on the system.
Given this was a centos machine, DNF wasn't readily available. I set the
config option dnf_warning=False and that fixed the failures. All previously
failed jobs were retried and should now be green. I also took the
opportunity to "upgrade" Fedora buildroot to F27.

On Wed, Feb 28, 2018 at 8:00 PM, Amar Tumballi  wrote:

> Looks like the tests here are continuously failing:
> https://build.gluster.org/job/devrpm-fedora/
>
> It would be great if someone takes a look at it.
>
> --
> Amar Tumballi (amarts)
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Infra machines update

2018-02-19 Thread Nigel Babu
Hello folks,

We're all out of Centos 6 nodes from today. I've just deleted the last of
them. We now run exclusively on Centos 7 nodes.

We've not received any negative feedback about plans to move NetBSD, so
I've disabled and removed all the NetBSD jobs and nodes as well.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Nigel Babu
On Mon, Feb 19, 2018 at 5:58 PM, Nithya Balachandran <nbala...@redhat.com>
wrote:

>
>
> On 19 February 2018 at 13:12, Atin Mukherjee <amukh...@redhat.com> wrote:
>
>>
>>
>> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu <nig...@redhat.com> wrote:
>>
>>> Hello,
>>>
>>> As you all most likely know, we store the tarball of the binaries and
>>> core if there's a core during regression. Occasionally, we've introduced a
>>> bug in Gluster and this tar can take up a lot of space. This has happened
>>> recently with brick multiplex tests. The build-install tar takes up 25G,
>>> causing the machine to run out of space and continuously fail.
>>>
>>
>> AFAIK, we don't have a .t file in upstream regression suits where
>> hundreds of volumes are created. With that scale and brick multiplexing
>> enabled, I can understand the core will be quite heavy loaded and may
>> consume up to this much of crazy amount of space. FWIW, can we first try to
>> figure out which test was causing this crash and see if running a gcore
>> after a certain steps in the tests do left us with a similar size of the
>> core file? IOW, have we actually seen such huge size of core file generated
>> earlier? If not, what changed because which we've started seeing this is
>> something to be invested on.
>>
>
> We also need to check if this is only the core file that is causing the
> increase in size or whether there is something else that is taking up a lot
> of space.
>
>
I don't disagree. However there are two problems here. In the few cases
where we've had such a large build-install tarball,

1. The tar doesn't actually finish being created. So it's not even
something that can be untar'd. It would just error out.
2. All subsequent jobs on this node fail.

The only remaining option is to watch out for situations when the tar file
doesn't finish creation and highlight it. When we moved to chunked
regressions, the nodes do not get re-used, so 2 isn't a problem.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins Issues this weekend and how we're solving them

2018-02-18 Thread Nigel Babu
Hello,

As you all most likely know, we store the tarball of the binaries and core
if there's a core during regression. Occasionally, we've introduced a bug
in Gluster and this tar can take up a lot of space. This has happened
recently with brick multiplex tests. The build-install tar takes up 25G,
causing the machine to run out of space and continuously fail.

I've made some changes this morning. Right after we create the tarball,
we'll delete all files in /archive that are greater than 1G. Please be
aware that this means all large files including the newly created tarball
will be deleted. You will have to work with the traceback on the Jenkins
job.




-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] build.gluster.org in shutdown mode

2018-02-14 Thread Nigel Babu
This upgrade is now complete and we're now running the latest version of
Jenkins.

On Thu, Feb 15, 2018 at 9:53 AM, Nigel Babu <nig...@redhat.com> wrote:

> Hello,
>
> I've just placed Jenkins in shutdown mode. No new jobs will be started for
> about an hour from now. I intend to upgrade Jenkins to pull in the latest
> security fixes.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] build.gluster.org in shutdown mode

2018-02-14 Thread Nigel Babu
Hello,

I've just placed Jenkins in shutdown mode. No new jobs will be started for
about an hour from now. I intend to upgrade Jenkins to pull in the latest
security fixes.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Planned Outage: supercolony.gluster.org on 2018-02-21

2018-01-31 Thread Nigel Babu
Hello folks,

We're going to be resizing the supercolony.gluster.org on our cloud
provider. This will definitely lead to a small outage for 5 mins. In the
event that something goes wrong in this process, we're taking a 2-hour
window for this outage.

Date: Feb 21
Server: supercolony.gluster.org
Time: 1000 to 1200 UTC / 1100 to 1300 CET / 1530 to 1730 IST
Services affected:
* gluster.org redirect
* lists.gluster.org (UI and mail server)
* planet.gluster.org

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Gerrit: Maintainers can now edit review topic

2018-01-30 Thread Nigel Babu
Hello,

Anoop pointed out that he couldn't edit the topic on a patch submitted by
an external contributor. To improve our drive-by contribution, I've enabled
Edit Topic permission for maintainers. This means you can fix topic
problems when an external contributor submits a patch.

Let me know if there are any other problems like this which need to be
fixed.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Infra-related Regression Failures and What We're Doing

2018-01-22 Thread Nigel Babu
Update: All the nodes that had problems with geo-rep are now fixed. Waiting
on the patch to be merged before we switch over to Centos 7. If things go
well, we'll replace nodes one by one as soon as we have one green on Centos
7.

On Mon, Jan 22, 2018 at 12:21 PM, Nigel Babu <nig...@redhat.com> wrote:

> Hello folks,
>
> As you may have noticed, we've had a lot of centos6-regression failures
> lately. The geo-replication failures are the new ones which particularly
> concern me. These failures have nothing to do with the test. The tests are
> exposing a problem in our infrastructure that we've carried around for a
> long time. Our machines are not clean machines that we automated. We setup
> automation on machines that were already created. At some point, we loaned
> machines for debugging. During this time, developers have inadvertently
> done 'make install' on the system to install onto system paths rather than
> into /build/install. This is what is causing the geo-replication tests to
> fail. I've tried cleaning the machines up several times with little to no
> success.
>
> Last week, we decided to take an aggressive path to fix this problem. We
> planned to replace all our problematic nodes with new Centos 7 nodes. This
> exposed more problems. We expected a specific type of machine from
> Rackspace. These are no longer offered. Thus, our automation fails on some
> steps. I've spent this weekend tweaking our automation so that it works
> on the new Rackspace machines and I'm down to just one test failure[1]. I
> have a patch up to fix this failure[2]. As soon as that patch is merged,
> we can push forward with Centos7 nodes. In 4.0, we're dropping support for
> Centos 6, so this decision makes more sense to do sooner than later.
>
> We'll not be lending machines anymore from production. We'll be creating
> new nodes which are a snapshots of an existing production node. This
> machine will be destroyed after use. This helps prevent this particular
> problem in the future. This also means that our machine capacity at all
> times is at 100 with very minimal wastage.
>
> [1]: https://build.gluster.org/job/cage-test/184/consoleText
> [2]: https://review.gluster.org/#/c/19262/
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Infra-related Regression Failures and What We're Doing

2018-01-21 Thread Nigel Babu
Hello folks,

As you may have noticed, we've had a lot of centos6-regression failures
lately. The geo-replication failures are the new ones which particularly
concern me. These failures have nothing to do with the test. The tests are
exposing a problem in our infrastructure that we've carried around for a
long time. Our machines are not clean machines that we automated. We setup
automation on machines that were already created. At some point, we loaned
machines for debugging. During this time, developers have inadvertently
done 'make install' on the system to install onto system paths rather than
into /build/install. This is what is causing the geo-replication tests to
fail. I've tried cleaning the machines up several times with little to no
success.

Last week, we decided to take an aggressive path to fix this problem. We
planned to replace all our problematic nodes with new Centos 7 nodes. This
exposed more problems. We expected a specific type of machine from
Rackspace. These are no longer offered. Thus, our automation fails on some
steps. I've spent this weekend tweaking our automation so that it works on
the new Rackspace machines and I'm down to just one test failure[1]. I have
a patch up to fix this failure[2]. As soon as that patch is merged, we can
push forward with Centos7 nodes. In 4.0, we're dropping support for Centos
6, so this decision makes more sense to do sooner than later.

We'll not be lending machines anymore from production. We'll be creating
new nodes which are a snapshots of an existing production node. This
machine will be destroyed after use. This helps prevent this particular
problem in the future. This also means that our machine capacity at all
times is at 100 with very minimal wastage.

[1]: https://build.gluster.org/job/cage-test/184/consoleText
[2]: https://review.gluster.org/#/c/19262/

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Please file a bug if you take a machine offline

2018-01-10 Thread Nigel Babu
Hello folks,

If you take a machine offline, please file a bug so that the machine can be
debugged and return to the pool.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Shutting down cloud machines

2018-01-07 Thread Nigel Babu
Hello folks,

In an effort to cut down machines that we don't use, I plan to shut down
the following machines in the following days. Please let me know if for
some reason I should not be shutting them down:

salt-master.gluster.org
webbuilder.gluster.org
nbslave70.cloud.gluster.org
nbslave71.cloud.gluster.org
nbslave72.cloud.gluster.org
nbslave74.cloud.gluster.org
nbslave75.cloud.gluster.org
nbslave77.cloud.gluster.org
test_gerrit_stats
debian_test_build

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] r.g.o returns a 503 error

2017-12-25 Thread Nigel Babu
Unplanned. I fixed this yesterday. Going to apply a more permanent fix
today.

The server restarted and we haven't implemented a way to start the service
when the machine restarts. We're testing a systemd config file in staging
and I'll look at applying that to production today.

On Mon, Dec 25, 2017 at 10:07 AM, Ravishankar N 
wrote:

> Season's Greetings. :-)
> review.gluster.org is seems to be down since yesterday. Is this a planned
> outage?
>
> Yours need-to-look-at-review-comments-on-my-patch-ingly,
> Ravi
>
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Moving Regressions to Centos 7

2017-12-20 Thread Nigel Babu
Hello folks,

We've been using Centos 6 for our regressions for a long time. I believe
it's time that we moved to Centos 7. It's causing us minor issues. For
example, tests run fine on the regression boxes but don't work on local
machines or vice-versa. Moving up gives us the ability to use newer
versions of tools as well.

If nobody has any disagreement, the plan is going to look like this:
* Bring up 10 Rackspace Centos 7 nodes.
* Test chunked regression runs on Rackspace Centos 7 nodes for one week.
* If all works well, kill off all the old nodes and switch all normal
regressions to Rackspace Centos 7 nodes.

I expect this process to be complete right around 2nd week of Jan. Please
let me know if there are concerns.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Changes in handling logs from (centos) regressions and smoke

2017-11-20 Thread Nigel Babu
Hello folks,

We're making some changes in how we handle logs from Centos regression and
smoke tests. Instead of having them available via HTTP access to the node
itself, it will be available via the Jenkins job as artifacts.

For example:
Smoke job: https://build.gluster.org/job/smoke/38523/console
Logs: https://build.gluster.org/job/smoke/38523/artifact/ (link available
from the main page)

We clear out regression logs every 30 days, so if you can see a regression
on build.gluster.org, logs for that should be available. This reduces the
need for space or HTTP access on our nodes and for separate deletion
process.

We also archive builds and cores. This is still available the old-fashioned
way, however, I intend to change that in the next few weeks to centralize
it to a file server.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Unplanned Jenkins restart

2017-11-19 Thread Nigel Babu
I noticed that Jenkins wasn't loading up this morning. Further debugging
showed a java heap size problem. I tried to debug it, but eventually just
restarted Jenkins. This means any running job or any job triggered was
stopped. Please re-trigger your jobs.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] gluster-zeroconf has been moved to the Gluster GitHub organisation

2017-11-16 Thread Nigel Babu
Please create a team add non-org admins as team maintainers.

On Thu, Nov 16, 2017 at 7:27 PM, Niels de Vos  wrote:

> Hi all,
>
> I have moved the gluster-zeroconf repository from my personal github
> account to the Gluster organisation one. Dustin Black and me are the
> current two contributors/maintainers, we'll be adding more if people
> express interest (Ramky?). There is no "GitHub Team" for the admins of
> this repository, I do not know if that is something we're doing for all
> of the projects we're hosting?
>
> For this repository we will be using GitHub Pull-Requests for reviewing
> and merging changes. The number of patches will likely be small and the
> Gerrit review workflow adds (too much) overhead.
>
> Niels
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Unplanned Jenkins restart this morning

2017-11-08 Thread Nigel Babu
Hello folks,

I had to do a quick Jenkins upgrade and restart this morning for an urgent
security fix. A few of our periodic jobs were cancelled, I'll re-trigger
them now.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Unplanned Gerrit Outage yesterday

2017-11-02 Thread Nigel Babu
Hello folks,

Yesterday, we had an unplanned Gerrit outage. We have now determined that
for some reason the machine rebooted for some reason. Michael is continuing
to debug what lead to this issue. Gerrit does not start automatically when
the VM restarted at this point.

We are currently testing a systemd unit file for Gerrit in staging. Once
that's in place, we can ensure that we start Gerrit automatically when we
restart the server.

Timeline of events (in CET):
16:29 - I receive an alert that Gerrit is down. This goes ignored because
we're still working on Jenkins.

18:25 - I notice the alerts as we're packing up for the day and start
Gerrit.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Postmortem of emails from gerrit-stage.rht.gluster.org

2017-11-01 Thread Nigel Babu
Hello folks,

Some of you may have gotten a large number of emails that your Gerrit
filters did not catch. This is because we added a config to Gerrit Stage
yesterday to start closing old reviews. This is a conversation we've had
for a while and it needed testing. In anticipation of a large number of
emails, I had shutdown the postfix server. However, I forgot that our
automation would turn postfix back on. This means some of you may have
gotten spammed today. You can trash all those emails.

We've intentionally broken main.cf so that Ansible does not restart postfix
again. So there should be no more spam for now.

This has also taught us that doing this in production is not advisable. The
staging server is currently rate-limited by major email providers due to
the number of emails we sent out. We can separately discuss how to close
old reviews.

Apologies again for the spam.


Thanks,
Nigel and Michael
Gluster Infra Team



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins OS Upgrade Complete

2017-11-01 Thread Nigel Babu
Hello folks,

The downtime window is now complete and here's a report on what's happened
today.

## Jenkins
We've moved build.gluster.org to a new server that runs Centos 7. This
server is managed in Ansible, though we do not yet manage Jenkins in
ansible. The old server is now called jenkins.rht.gluster.org. If you want
to copy something off your home directory, now is the time. We will be
shutting off this server by December. We've confirmed that jobs can be
triggered and report back to Gerrit.

We spent a large amount of time waiting for rsync to copy all the files
over. We had about 50G of logs, RPMs, and associated files with the jobs
folder.

## bits.gluster.org
We host an archive of all release builds on bits.gluster.org. This used to
be on the build.gluster.org server. In fact, our release job depended on
this. We've created a new server called http.int.rht.gluster.org (internal)
to host static content like this. During this migration, we copied all the
files to the new server and gave it about 5GB of space in total. If we can
clear out files from this server, please get in touch with us via a bug[1].

If you're a release manager, please pay attention to the following
information. The release job on Jenkins is now deprecated. Please use the
release-new job instead. When you use the release-new job, it will not
place the tarball in bits.gluster.org (for now). This will be fixed this
week. We're deferring this right now due to the late hour and the long day
Michael and I have both had with this migration.

If you face any issue with jenkins, please do not hesitate to file a bug
and get in touch with us.

Thanks,
Nigel and Michael
Gluster Infra Team

[1]:
https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS=project-infrastructure


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins outage on 1st Nov

2017-10-30 Thread Nigel Babu
Hello folks,

We'll have a Jenkins outage on 1st Nov for Jenkins. This outage will be
from 0900 to 1700 UTC.

EDT: 0500 to 1300
CET: 1000 to 1800
IST: 1430 to 2230

Given that Michael and I are co-located and this is a holiday in India,
it's a good opportunity for us to fix a lot of security issues with the old
Jenkins server and move it onto a fresh and clean machine. At this point,
we cannot estimate how much time we'll take.

Ideally, here's what we're looking to do:

* Setup a new VM for Jenkins in CentOS 7.
* Install Jenkins and dependencies.
* Move over the /var/lib/jenkins folder from current install to new one.
* Switch DNS for build.gluster.org to the new server.
* Move over the bits.gluster.org data to a http.int.rht.gluster.org and
setup a proxy redirect.
* Change the release job to push from Jenkins job to
http.int.rht.gluster.org.
* Swtich DNS for bits.gluster.org.

Depending on complications that may arise during this time, we may or may
not do the bits.gluster.org migration along with this.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Quarterly Infra Updates

2017-10-30 Thread Nigel Babu
It's been a while since I posted an update. We're shifting to a quarterly
update system from this time onwards.

Here's what's kept us busy last quarter:
* We've moved our long-term planning from "bugs" to a (currently private)
Trello board. This helps us plan for long-term projects and scheduling time
for fixing infrastructure debt.
* Static analysis is almost entirely on Jenkins now except for Coverity.
Thanks to Deepshika for her hard work in getting this to production. We
plan to move this to Jenkins as well with a view on using coverity.com
rather than the Red Hat internal instance. This gives us the ability to see
new issues and severity with better clarity.
* We commit our Jenkins job to the build-jobs repo on Gerrit. Now when a
review is merged, it automatically runs `jenkins-jobs update`. This means
the latest version of the job configuration is running on Jenkins all the
time.
* We now collect logs from an aborted regression job.
* Jenkins upgraded to the latest version (2.73.2) to prevent security
issues.
* We can now trigger jobs on ci.centos.org from build.gluster.org. This is
useful to make the pipeline pieces fit together.
* On a change which only modifies tests, the regression run will only run
those specific tests. Especially useful when tests are deactivated.
* All the smoke jobs now run on internal machines wasting less public IPs.
* We're in the processing of setting up ci-logs.gluster.org. Once this is
setup, individual test servers will no longer have a webserver.

Plans for this quarter:
* Gerrit upgrade to 2.13.9 [DONE!]
* Jenkins OS upgrade to CentOS 7
* Get statedumps from aborted regression runs
* Build Debian packages via Jenkins
* Tweak Gerrit permissions
* Finish the master pipeline job
* Create the release pipeline job
* Get Glusto debugging setups.
* Get CentOS regressions split into 10 chunks to reduce time.

Michael and I will be working together from the Brno office this week. We
want to get done with the Gerrit upgrade (DONE, yay!) and the Jenkins OS
upgrade. The Jenkins upgrade will happen on the 1st of Nov as it's a
holiday in India. We'll be taking a full 8-hour window to finish everything
out. The Jenkins OS upgrade will clear out SSH access to everyone. If you
have files in your home directory on the Jenkins server that you'd like to
preserve, please do so now.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Upgrading Gerrit at 1400 UTC

2017-10-30 Thread Nigel Babu
Hello,

This upgrade is now complete. We've tested a push and merge. Please let us
know if you run into any troubles.

On Mon, Oct 30, 2017 at 10:09 AM, Nigel Babu <nig...@redhat.com> wrote:

> Hello,
>
> We've been running Gerrit staging on the latest version of Gerrit. We've
> planned several times to do an upgrade, but the timing hasn't worked out.
> Given that most people are traveling back from Gluster Summit, I'll be
> working on doing an upgrade today.
>
> The downtime window will be from 1400 to 1500 UTC. In other relevant
> timezones:
> EDT: 1000 to 1100
> CET: 1500 to 1600
> IST: 1930 to 2030
>
> The outage should not not take more than 10 minutes in reality, but we're
> taking a larger window for rollback in the event that the deployment is not
> successful.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Upgrading Gerrit at 1400 UTC

2017-10-30 Thread Nigel Babu
Hello,

We've been running Gerrit staging on the latest version of Gerrit. We've
planned several times to do an upgrade, but the timing hasn't worked out.
Given that most people are traveling back from Gluster Summit, I'll be
working on doing an upgrade today.

The downtime window will be from 1400 to 1500 UTC. In other relevant
timezones:
EDT: 1000 to 1100
CET: 1500 to 1600
IST: 1930 to 2030

The outage should not not take more than 10 minutes in reality, but we're
taking a larger window for rollback in the event that the deployment is not
successful.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Jenkins Nodes changes

2017-10-11 Thread Nigel Babu
That's among the few plans we have once we get this up and running.

On Wed, Oct 11, 2017 at 11:53 AM, Amar Tumballi <atumb...@redhat.com> wrote:

> Can we keep nightly builds of different branches in this new server? Would
> be good to keep just the last 7 days of builds.
>
> Regards,
> Amar
>
> On 11-Oct-2017 10:14 AM, "Nigel Babu" <nig...@redhat.com> wrote:
>
>> Hello folks,
>>
>> I've just gotten back after a week away. I've made a couple of changes to
>> Jenkins nodes:
>>
>> * All smoke jobs now run on internal nodes.
>> * All Rackspace nodes are back in action. We had a few issues with some
>> nodes, all of them have been looked into and fixed.
>>
>> In the near future, we plan to have a ci-logs.gluster.org domain where
>> the smoke logs and regression logs will be available instead of having a
>> web server on individual nodes. Deepshika is actively working on making the
>> changes required to get this done.
>>
>> --
>> nigelb
>>
>> ___
>> Gluster-infra mailing list
>> Gluster-infra@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-infra
>>
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] review.gluster.org outage today

2017-09-25 Thread Nigel Babu
Hello folks,

We had a brief outage today of review.gluster.org.

# Timeline of Events (Times in IST)
1311: I receive a notification from monitoring that review.gluster.org is
throwing 503 errors. I logged into the machine and noticed that Gerrit
wasn't running at all. I started the service and it came back up instantly.

1315: Review.gluster.org back online.

# Root Cause
At this point, we have no idea. We'll investigate this during the week.

# Next Actions
* Speed up getting Gerrit on our internal monitoring so we can monitor
CPU/Memory consumption.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] review.gluster.org outage for the next 5 mins

2017-09-19 Thread Nigel Babu
Hello folks,

We need to do a restart of Gerrit thanks to a Java security update. We'll
be back in ~5 mins or so.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Postmortem for yesterday's outage

2017-09-18 Thread Nigel Babu
Hello folks,

We had a brief outage yesterday that misc and I were working on fixing. We're
committed to doing a formal post-mortem of outages whether it affects everyone
or not, as a habit. Here's a post-mortem of yesterday's event.

## Affected Servers
* salt-master.rax.gluster.org
* syslog01.rax.gluster.org

## Total Duration
~4 hours

## What Happened
A few Rackspace servers depend on DHCP (default rackspace setup). Due to Centos
7.4 upgrade, we rebooted some server, since kernel and other packages were
upgraded. At this point, we're unsure if this is a DHCP bug, an upgrade gone
wrong, or if Rackspace DHCP servers are at fault. We will be looking into this
in the coming days.

Michael had issues with the Rackspace console, Nigel stepped in to help with
the outage.

Once we accesses the machine via Emergency Console, we spent some trying to get
a DHCP lease. When that didn't work, we started working on setting up a static
IP and gateway. This took a few tries since the Rackspace documentation for
doing this was wrong. There's also a slight difference between "ip" and
"ifconfig" further creating confusion.

This is what we eventually did on one of the servers:
ip address add 162.209.109.18/24 dev eth0
route add default gw 162.209.109.1

This incident did not affect any of our critical services. Gerrit, Jenkins, and
download.gluster.org remained unaffected during this period.

We were limited in our ability to roll out any changes via ansible to these
servers during this ~4h window. We have a second server in progress for
deploying infrastructure but the setup is not ready yet. Manual roll-out from
sysadmins laptop was always possible in case of trouble.

## Timeline of Events
Note: All times in (CEDT)
* 09:00 am: Nigel and Michael are planning a new http server inside the cage for
   logs, packages, and Coverity scans.
* 10:00 am: Michael starts the ansible process to install the new server
* 12:10 am: The topic of Centos 7.4 upgrade come during discussion and Michael
does an upgrade and reboot on the salt-master.rax.gluster.org.
* 12:15 pm: Michael notices that the salt-master server is not coming back.
Nigel confirms.
* 12:15 pm: Nigel logs into Rackspace and does a hard restart on the
salt-master machine. No luck.
* 12:34 pm: Nigel logs a ticket with Rackspace about the server outage.
* 12:44 pm: Nigel starts chat conversation with Rackspace support for
escalation. Customer support engineer informs us that the server is
up and can be accessed via Emergency Console.
* 12:57 pm: Nigel gains access via Emergency Console. Michael's initial RCA of
the isssue is a network problem caused by upgrade. Nigel confirms
the RCA by verifying that eth0 does not have a public IP. Nigel
tries to get the IP address to stick with the right gateway.
* 12:35 pm: Nigel manages to get salt-master online briefly.
* 13:34 pm: Nigel brings the salt-master back online.
* 13:40 pm: Michael try to upgrade the syslog server, reboot it, not coming up
either
* 13:55 pm: Nigel brings back syslog back online as well.

## Pending Actions
* Michael to figure out if there is a bug in the new DHCP daemon, or if things
  changed Rackspace side.
* Michael to finish move of salt-master into the cage
  (ant-queen.int.rht.gluster.org) to prevent further issues.
* Nigel to send a note to Rackspace support to fix their documentation.

--
Nigel and Michael
Gluster Infra Team
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] Postmortem for Thursday's outage

2017-09-18 Thread Nigel Babu
Hello folks,

We had a brief outage yesterday that misc and I were working on fixing.
We're
committed to doing a formal post-mortem of outages whether it affects
everyone
or not, as a habit. Here's a post-mortem of yesterday's event.

## Affected Servers
* salt-master.rax.gluster.org
* syslog01.rax.gluster.org

## Total Duration
~4 hours

## What Happened
A few Rackspace servers depend on DHCP (default rackspace setup). Due to
Centos
7.4 upgrade, we rebooted some server, since kernel and other packages were
upgraded. At this point, we're unsure if this is a DHCP bug, an upgrade gone
wrong, or if Rackspace DHCP servers are at fault. We will be looking into
this
in the coming days.

Michael had issues with the Rackspace console, Nigel stepped in to help with
the outage.

Once we accesses the machine via Emergency Console, we spent some trying to
get
a DHCP lease. When that didn't work, we started working on setting up a
static
IP and gateway. This took a few tries since the Rackspace documentation for
doing this was wrong. There's also a slight difference between "ip" and
"ifconfig" further creating confusion.

This is what we eventually did on one of the servers:
ip address add 162.209.109.18/24 dev eth0
route add default gw 162.209.109.1

This incident did not affect any of our critical services. Gerrit, Jenkins,
and
download.gluster.org remained unaffected during this period.

We were limited in our ability to roll out any changes via ansible to these
servers during this ~4h window. We have a second server in progress for
deploying infrastructure but the setup is not ready yet. Manual roll-out
from
sysadmins laptop was always possible in case of trouble.

## Timeline of Events
Note: All times in (CEDT)
* 09:00 am: Nigel and Michael are planning a new http server inside the
cage for
   logs, packages, and Coverity scans.
* 10:00 am: Michael starts the ansible process to install the new server
* 12:10 am: The topic of Centos 7.4 upgrade come during discussion and
Michael
does an upgrade and reboot on the salt-master.rax.gluster.org.
* 12:15 pm: Michael notices that the salt-master server is not coming back.
Nigel confirms.
* 12:15 pm: Nigel logs into Rackspace and does a hard restart on the
salt-master machine. No luck.
* 12:34 pm: Nigel logs a ticket with Rackspace about the server outage.
* 12:44 pm: Nigel starts chat conversation with Rackspace support for
escalation. Customer support engineer informs us that the
server is
up and can be accessed via Emergency Console.
* 12:57 pm: Nigel gains access via Emergency Console. Michael's initial RCA
of
the isssue is a network problem caused by upgrade. Nigel
confirms
the RCA by verifying that eth0 does not have a public IP. Nigel
tries to get the IP address to stick with the right gateway.
* 12:35 pm: Nigel manages to get salt-master online briefly.
* 13:34 pm: Nigel brings the salt-master back online.
* 13:40 pm: Michael try to upgrade the syslog server, reboot it, not coming
up
either
* 13:55 pm: Nigel brings back syslog back online as well.

## Pending Actions
* Michael to figure out if there is a bug in the new DHCP daemon, or if
things
  changed Rackspace side.
* Michael to finish move of salt-master into the cage
  (ant-queen.int.rht.gluster.org) to prevent further issues.
* Nigel to send a note to Rackspace support to fix their documentation.

--
Nigel and Michael
Gluster Infra Team
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Jenkins Restart at 1830 IST

2017-08-29 Thread Nigel Babu
The restart and upgrade is complete. Any jobs that were triggered during
the quiet period are starting up now.

On Tue, Aug 29, 2017 at 4:05 PM, Nigel Babu <nig...@redhat.com> wrote:

> Hello folks,
>
> We need to fix a networking problem on Jenkins, upgrade, and apply a few
> security fixes to Jenkins. We'll be going into quiet mode now and any jobs
> still running at 18:30 will be canceled and re-triggered after the restart.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Migration from gerrit bugzilla hook to a jenkins job

2017-08-17 Thread Nigel Babu
On Thu, Aug 17, 2017 at 5:36 PM, Mohammed Rafi K C <rkavu...@redhat.com>
wrote:

>
>
> On 08/17/2017 05:07 PM, Nigel Babu wrote:
>
> This change is taking the first step towards implementing those ideas. One
> of the major blockers to implementing them was that it was difficult to
> grant easy access to change the hook. Granting production access to Gerrit
> is next to impossible unless you really know what you're doing.
>
> We'll not increase the scope *right now* to include automating the entire
> bug workflow.
>
>
> Cool. If I understand correctly , once we have the jenkins job in place,
> we can resume the work done by Manikandan . is that right ?
>

Yes. The Jenkins job is in place precisely to facilitate the work that
Manikandan and Nandaja started in much nicer way.


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Migration from gerrit bugzilla hook to a jenkins job

2017-08-17 Thread Nigel Babu
This change is taking the first step towards implementing those ideas. One
of the major blockers to implementing them was that it was difficult to
grant easy access to change the hook. Granting production access to Gerrit
is next to impossible unless you really know what you're doing.

We'll not increase the scope *right now* to include automating the entire
bug workflow.

On Thu, Aug 17, 2017 at 4:53 PM, Mohammed Rafi K C 
wrote:

> Some time back (2 yrs :) ) , we had discussion on automated bug work flow
> enhancement [1] . Nandaja and Manikandan were working on this. I'm not sure
> about the current status. If you have some time take a look at the
> proposals [2] and how we can proceed.
>
>
> [1] : http://lists.gluster.org/pipermail/gluster-devel/2015-
> July/046084.html
>
> [2] : http://lists.gluster.org/pipermail/gluster-devel/2015-
> May/045374.html
>
>
> Regards
>
> Rafi
>
> On 08/17/2017 10:07 AM, Deepshikha Khandelwal wrote:
>
> Hello everyone,
>
> We are planning to move from posting to bugzilla via a gerrit hook to a
> jenkins job on build.gluster.org so that it can be edited easily and the
> output is properly visible.
>
> This bugzilla-post[1] job will be triggered by gerrit events and will
> update the status of the bug on bugzilla. This job is currently running in
> dry run mode, so it will not actually post anything to bugzilla. We will be
> verifying the job over the course of the next few days. If you notice
> anything odd in what it’s doing, please let us know.
>
> [1] 
> https://build.gluster.org/job/bugzilla-post/
>
> Thanks & Regards,
> Deepshikha Khandelwal
>
>
> ___
> Gluster-devel mailing 
> listGluster-devel@gluster.orghttp://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Jenkins and Gerrit issues today

2017-07-17 Thread Nigel Babu
We just started using the Jenkins pipeline and its associated plugins:
https://build.gluster.org/job/nightly-master/

On Mon, Jul 17, 2017 at 6:08 PM, Michael Scherer <msche...@redhat.com>
wrote:

> Le vendredi 14 juillet 2017 à 13:00 +0530, Nigel Babu a écrit :
> > Hello,
> >
> > ## Highlights
> > * If you pushed a patch today or did "recheck centos", please do a
> recheck.
> > Those jobs were not triggered.
> > * Please actually verify that the jobs for your patches have started. You
> > can do that by visiting https://build.gluster.org/job/smoke/ (for
> smoke) or
> > https://build.gluster.org/job/centos6-regression/ (for regression) and
> > searching for your review. Verify that the patchset is correct.
> >
> > ## The Details
> >
> > This morning I installed critical security updates for Jenkins that
> needed
> > a restart of Jenkins.
>
> So I did look at the list of issues, and none did seems to affect
> Jenkins plugins we used, so I didn't do the upgrade. I was not sure
> however of which one did we used, so can you details a bit ?
>
> > After this restart, it appears that the Gerrit plugin
> > failed to load because of an XML error in the config file. As far as I
> > know, this error has always existed, but the newer version of the plugin
> > became more strict in xml parsing. I noticed this only about an hour so
> ago
> > and I've fixed it. Please let me know if there are further problems. Due
> to
> > this any jobs that should have been triggered since about 8:30 am this
> > morning were not triggered. Please manually do a recheck for your
> patches.
> >
> > Additionally, Ravi and Nithya pointed me to a problem where Gerrit wasn't
> > responding. We've noticed this quite often because we've configured
> Gerrit
> > to not drop idle connections. This forces us to restart Gerrit when there
> > are too many long-running idle connections. I've put a timeout of 10 mins
> > for idle connections. This issue should be sorted.
> >
> > However, Jenkins does an SSH connection with Gerrit by running `ssh
> > jenk...@review.gluster.org stream-events`. I'm not sure if this Gerrit
> > config change will cause a conflict with Jenkins, but we'll see in the
> next
> > few hours. None of the documentation explicitly points to a problem.
> >
> > ___
> > Gluster-infra mailing list
> > Gluster-infra@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-infra
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Did we exhausted the disk space?

2017-07-17 Thread Nigel Babu
Please file a bug:
https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS=project-infrastructure

On Mon, Jul 17, 2017 at 6:17 PM, Karthik Subrahmanya 
wrote:

> Hi,
>
> One of my patch[1] just failed smoke test with error:
>
> Disk Requirements:At least 21MB more space needed on the / filesystem.
>
> [1] https://review.gluster.org/#/c/17485/
>
> Regards,
> Karthik
>
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Jenkins and Gerrit issues today

2017-07-14 Thread Nigel Babu
Note: I've retriggered smoke/regression where appropriate for all patches
posted since the issue started.

On Fri, Jul 14, 2017 at 1:00 PM, Nigel Babu <nig...@redhat.com> wrote:

> Hello,
>
> ## Highlights
> * If you pushed a patch today or did "recheck centos", please do a
> recheck. Those jobs were not triggered.
> * Please actually verify that the jobs for your patches have started. You
> can do that by visiting https://build.gluster.org/job/smoke/ (for smoke)
> or https://build.gluster.org/job/centos6-regression/ (for regression) and
> searching for your review. Verify that the patchset is correct.
>
> ## The Details
>
> This morning I installed critical security updates for Jenkins that needed
> a restart of Jenkins. After this restart, it appears that the Gerrit plugin
> failed to load because of an XML error in the config file. As far as I
> know, this error has always existed, but the newer version of the plugin
> became more strict in xml parsing. I noticed this only about an hour so ago
> and I've fixed it. Please let me know if there are further problems. Due to
> this any jobs that should have been triggered since about 8:30 am this
> morning were not triggered. Please manually do a recheck for your patches.
>
> Additionally, Ravi and Nithya pointed me to a problem where Gerrit wasn't
> responding. We've noticed this quite often because we've configured Gerrit
> to not drop idle connections. This forces us to restart Gerrit when there
> are too many long-running idle connections. I've put a timeout of 10 mins
> for idle connections. This issue should be sorted.
>
> However, Jenkins does an SSH connection with Gerrit by running `ssh
> jenk...@review.gluster.org stream-events`. I'm not sure if this Gerrit
> config change will cause a conflict with Jenkins, but we'll see in the next
> few hours. None of the documentation explicitly points to a problem.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins and Gerrit issues today

2017-07-14 Thread Nigel Babu
Hello,

## Highlights
* If you pushed a patch today or did "recheck centos", please do a recheck.
Those jobs were not triggered.
* Please actually verify that the jobs for your patches have started. You
can do that by visiting https://build.gluster.org/job/smoke/ (for smoke) or
https://build.gluster.org/job/centos6-regression/ (for regression) and
searching for your review. Verify that the patchset is correct.

## The Details

This morning I installed critical security updates for Jenkins that needed
a restart of Jenkins. After this restart, it appears that the Gerrit plugin
failed to load because of an XML error in the config file. As far as I
know, this error has always existed, but the newer version of the plugin
became more strict in xml parsing. I noticed this only about an hour so ago
and I've fixed it. Please let me know if there are further problems. Due to
this any jobs that should have been triggered since about 8:30 am this
morning were not triggered. Please manually do a recheck for your patches.

Additionally, Ravi and Nithya pointed me to a problem where Gerrit wasn't
responding. We've noticed this quite often because we've configured Gerrit
to not drop idle connections. This forces us to restart Gerrit when there
are too many long-running idle connections. I've put a timeout of 10 mins
for idle connections. This issue should be sorted.

However, Jenkins does an SSH connection with Gerrit by running `ssh
jenk...@review.gluster.org stream-events`. I'm not sure if this Gerrit
config change will cause a conflict with Jenkins, but we'll see in the next
few hours. None of the documentation explicitly points to a problem.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Missing information - netbsd regression.

2017-07-05 Thread Nigel Babu
We no longer run netbsd regression on a per patch basis. See:
http://lists.gluster.org/pipermail/gluster-devel/2017-June/053080.html

On Tue, Jul 4, 2017 at 6:33 AM, Ravishankar N 
wrote:

> Hi,
>
> https://build.gluster.org/job/netbsd7-regression/ used to show the patch
> ID and revision that triggered each build on the left-hand side 'Build
> History' column. I do not see it any more. It works fine for centos builds
> though: https://build.gluster.org/job/centos6-regression/.  Is something
> broken after the recent upgrade?
>
> Regards,
>
> Ravi
>
>
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Gerrit and Jenkins status

2017-06-26 Thread Nigel Babu
Hello folks,

We had downtimes today for both Gerrit and Jenkins related upgrades. The Gerrit
upgrade went very smoothly. We will need to figure out a date in the short-term
where we'll upgrade to the next major release.

The Jenkins upgrade caused some issues causing some downtime. The latest
version of Jenkins requires Java 1.8, so even though Jenkins itself was back
online with 5 minutes, I had to go around installing Java on all the nodes. For
CentOS nodes, this was trivial. So we were back up and running for most jobs
that block patches within the downtime window. NetBSD jobs took some more time
as I figured out the best way to go about fixing this problem.

As far as I can see, there are no java-1.8 packages for NetBSD 6 (we run smoke
on NetBSD 6). I've therefore upgraded our smoke jobs to run on NetBSD 7.

Please let me know if there are any problems with Gerrit or Jenkins.


--
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] Jenkins outage on Jun 26

2017-06-22 Thread Nigel Babu
Hello folks,

We'll also have a short Jenkins outage on 26 June 2017, for a Jenkins plugin
installation and upgrade.

Date: 26th June 2017
Time: 0230 UTC (2230 EDT / 0430 CEST / 0800 IST)
Duration: 1h

Jenkins will be in a quiet time from 1h before the outage where no new builds
will be allowed to start.

--
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] Cleaning up Jenkins

2017-06-20 Thread Nigel Babu
On Thu, Apr 20, 2017 at 10:57:53AM +0530, Nigel Babu wrote:
> Hello folks,
>
> As I was testing the Jenkins upgrade, I realized we store quite a lot of old
> builds on Jenkins that doesn't seem to be useful. I'm going to start cleaning
> them slowly in anticipation of moving Jenkins over to a CentOS 7 server in the
> not-so-distant future.
>
> * Old and disabled jobs will be deleted completely.
> * Discard regression logs older than 90 days.
> * Discard smoke and dev RPM logs older than 30 days.
> * Discard post-build RPM jobs older than 10 days.
> * Release job will be unaffected. We'll store all logs.
>
> If we want to archive the old regression logs, I might looking at storing them
> some place that's not the Jenkins machine. If you have concerns or comments,
> please let me know.

I've made the changes today. All jobs (except release jobs and regression jobs
will be deleted after 30 days). Regression logs will be kept for 90 days so we
can debug intermittent failures.

--
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] build.gluster.org downtime on 20th June

2017-06-19 Thread Nigel Babu
Other than Gerrit, the rest can be rebooted ruthlessly.

On 19-Jun-2017 23:16, "Michael Scherer" <msche...@redhat.com> wrote:

> Le lundi 19 juin 2017 à 17:55 +0200, Michael Scherer a écrit :
> > Le lundi 19 juin 2017 à 17:12 +0530, Nigel Babu a écrit :
> > > Hello folks,
> > >
> > > We'll be having a short downtime for build.gluster.org on 20th June
> 2017
> > > (tomorrow).
> > >
> > > Date: 20th June 2017
> > > Time: 0230 UTC (2230 EDT / 0430 CEST / 0800 IST)
> > > Duration: 1h
> > >
> > > Jenkins will be in a quiet time from 1h before the outage where no new
> builds
> > > will be allowed to start.
> > >
> > > This downtime is to complete the installation of a required plugin.
> Though
> > > there is a 1h downtime window, I expect the actual outage to last
> about 20
> > > minutes at maximum.
> >
> > Since https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt just
> > have been published and it kinda mean "rebooting the whole infra", can
> > we combine do it at the same time ? (it might mean "wait until we have
> > Centos updated package", ie likely more than 24h) ?
>
> Ok, so upon reading the advisory a bit more, I do not see much reason to
> panic (selinux would contains most exploitation vectors I can think off,
> we strongly limited access, and the exploit aren't exactly silent nor
> fast), but I would still like to combine the 2 downtime if possible.
>
> I will also ruthlessly reboot the various builders during the week,
> unless people tell me not to do it.
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] build.gluster.org downtime on 20th June

2017-06-19 Thread Nigel Babu
Hello folks,

We'll be having a short downtime for build.gluster.org on 20th June 2017
(tomorrow).

Date: 20th June 2017
Time: 0230 UTC (2230 EDT / 0430 CEST / 0800 IST)
Duration: 1h

Jenkins will be in a quiet time from 1h before the outage where no new builds
will be allowed to start.

This downtime is to complete the installation of a required plugin. Though
there is a 1h downtime window, I expect the actual outage to last about 20
minutes at maximum.

--
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] Weird branches in the glusterfs repository on GitHub

2017-05-01 Thread Nigel Babu
The first one is a Gerrit configuration branch and it *should* be there.

The other two, well, I recommend asking around to see who created those
branches. There's no "sync script". We sync all of what's in Gerrit over to
Github. If it's there, it means someone created it on Gerrit.

On Sun, Apr 30, 2017 at 3:40 PM, Niels de Vos  wrote:

> There are a few branches in the glusterfs github repository that should
> not be there:
>
>   - meta/config (partial/old Gerrit configuration)
>   - v3.7.15
>   - v3.8.2
>
> The last two are *branches*, and the matching tags exist as well. I'm
> not sure since when they are in the github repository, I dont seem to be
> able to find details about any of the the push operations there.
>
> Could it be that a sync-script went wrong at one point?
>
> Niels
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins authentication with Github

2017-04-27 Thread Nigel Babu
Hello folks,

In testing the Jenkins upgrade, I learned that we allow Jenkins read access to
/etc/shadow to allow Unix authentication. In addition to this, our
authentication was open to brute force attacks and hard to keep the user list
updated. To ease these pains, we've switched Jenkins authentication to
Github.

If you had a shell account on build.gluster.org to authenticate with Jenkins,
this account will soon be deactivated. Please get in touch if you continue to
need access.

--
nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Jenkins Upgrade

2017-04-27 Thread Nigel Babu
Hello,

The upgrade is now complete and we should be good to go. Please let me know if
there are any problems.

--
nigelb


On Thu, Apr 27, 2017 at 11:34:05AM +0530, Nigel Babu wrote:
> Hello folks,
>
> The first part of the Jenkins upgrade has now begun. The Jenkins server is now
> on quiet mode. No new builds will be scheduled. I will be shuting down Jenkins
> in the next 1h to begin the upgrade.
>
> --
> nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Jenkins Upgrade

2017-04-27 Thread Nigel Babu
Hello folks,

The first part of the Jenkins upgrade has now begun. The Jenkins server is now
on quiet mode. No new builds will be scheduled. I will be shuting down Jenkins
in the next 1h to begin the upgrade.

--
nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Planned Jenkins Outage on 27th Apr (Thu)

2017-04-26 Thread Nigel Babu
Hello

A reminder that this is today.

On Tue, Apr 18, 2017 at 04:06:25PM +0530, Nigel Babu wrote:
> Hello folks,
>
> We're announcing a Jenkins outage window on 27th Apr 2017 during the following
> times:
>
> 0300 - 0700 EDT
> 0700 - 1100 UTC
> 0800 - 1200 CEST
> 1230 - 1630 IST
>
> This is purposely in the middle of the working day so we can identify any
> issues that might come up post-upgrade and fix them immediately.
>
> We will be starting with a 1h quiet period where no jobs will be started on
> Jenkins. At this time we'll do a back up of our existing instance so we can
> restore in case we run into problems with the upgrade. After the 1h quiet
> period, we'll cancel any running jobs. Please make sure you restart these jobs
> post-upgrade. We're planning to upgrade to the latest LTS (2.46.1 currently).
> We will continue to upgade Jenkins every 12 weeks so we keep tracking the LTS
> version for continued security and stability.
>
> --
> nigelb



--
nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Cleaning up Jenkins

2017-04-19 Thread Nigel Babu
Hello folks,

As I was testing the Jenkins upgrade, I realized we store quite a lot of old
builds on Jenkins that doesn't seem to be useful. I'm going to start cleaning
them slowly in anticipation of moving Jenkins over to a CentOS 7 server in the
not-so-distant future.

* Old and disabled jobs will be deleted completely.
* Discard regression logs older than 90 days.
* Discard smoke and dev RPM logs older than 30 days.
* Discard post-build RPM jobs older than 10 days.
* Release job will be unaffected. We'll store all logs.

If we want to archive the old regression logs, I might looking at storing them
some place that's not the Jenkins machine. If you have concerns or comments,
please let me know.

--
nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Planned Jenkins Outage on 27th Apr (Thu)

2017-04-18 Thread Nigel Babu
Hello folks,

We're announcing a Jenkins outage window on 27th Apr 2017 during the following
times:

0300 - 0700 EDT
0700 - 1100 UTC
0800 - 1200 CEST
1230 - 1630 IST

This is purposely in the middle of the working day so we can identify any
issues that might come up post-upgrade and fix them immediately.

We will be starting with a 1h quiet period where no jobs will be started on
Jenkins. At this time we'll do a back up of our existing instance so we can
restore in case we run into problems with the upgrade. After the 1h quiet
period, we'll cancel any running jobs. Please make sure you restart these jobs
post-upgrade. We're planning to upgrade to the latest LTS (2.46.1 currently).
We will continue to upgade Jenkins every 12 weeks so we keep tracking the LTS
version for continued security and stability.

--
nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Is anyone else having trouble authenticating with review.gluster.org over ssh?

2017-04-16 Thread Nigel Babu
This should be fixed now: https://bugzilla.redhat.com/
show_bug.cgi?id=1442672

Vijay, can you link me to your failed Jenkins job? Jenkins should have been
able to clone since it uses the git protocol and not SSH.

On Sun, Apr 16, 2017 at 9:25 PM, Vijay Bellur  wrote:

>
>
> On Sun, Apr 16, 2017 at 11:44 AM, Raghavendra Talur 
> wrote:
>
>> On Sun, Apr 16, 2017 at 9:07 PM, Raghavendra Talur 
>> wrote:
>> > I am not able to login even after specifying the key file
>> >
>> > $ ssh -T -vvv -i ~/.ssh/gluster raghavendra-ta...@git.gluster.org
>> > OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017
>> > debug1: Reading configuration data /home/rtalur/.ssh/config
>> > debug1: Reading configuration data /etc/ssh/ssh_config
>> > debug3: /etc/ssh/ssh_config line 56: Including file
>> > /etc/ssh/ssh_config.d/05-redhat.conf depth 0
>> > debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
>> > debug1: /etc/ssh/ssh_config.d/05-redhat.conf line 2: include
>> > /etc/crypto-policies/back-ends/openssh.txt matched no files
>> > debug1: /etc/ssh/ssh_config.d/05-redhat.conf line 8: Applying options
>> for *
>> > debug1: auto-mux: Trying existing master
>> > debug1: Control socket
>> > "/tmp/ssh_mux_git.gluster.org_22_raghavendra-talur" does not exist
>> > debug2: resolving "git.gluster.org" port 22
>> > debug2: ssh_connect_direct: needpriv 0
>> > debug1: Connecting to git.gluster.org [8.43.85.171] port 22.
>> > debug1: Connection established.
>> > debug1: identity file /home/rtalur/.ssh/gluster type 1
>> > debug1: key_load_public: No such file or directory
>> > debug1: identity file /home/rtalur/.ssh/gluster-cert type -1
>> > debug1: Enabling compatibility mode for protocol 2.0
>> > debug1: Local version string SSH-2.0-OpenSSH_7.4
>> > ssh_exchange_identification: Connection closed by remote host
>>
>> Confirmed with Pranith that he is facing same issue.
>>
>
>
> One of my jenkins jobs also bailed out since it was unable to clone from
> r.g.o.
>
> Thanks,
> Vijay
>
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] compare-bug-version-and-git-branch now runs on bugziller.rht.gluster.org

2017-04-05 Thread Nigel Babu
Hello folks,

I've been slowly working on moving jobs off our master machine on
build.gluster.org. This will let us upgrade to a new Jenkins server without too
much pain. In this regard, I've moved compare-bug-version-and-git-branch off
master onto bugziller.rht.gluster.org.

This machine has bugzilla credentials for our bot enabled. In the near future,
our post-commit hook on gerrit will move to a Jenkins job as well so that it's
easier to submit patches to the hook and to make changes without affecting
production Gerrit at the same time.

The next job scheduled to be moved off master is the release job. Once
that's done, we'll start work on upgrading Jenkins to the latest LTS in the 2.0
series.

We're also contemplating switching Jenkins authentication to Github since we
use it for Gerrit anyway and it will let us not handle that piece of
authentication ourselves. More on that after the upgrade.

--
nigelb


signature.asc
Description: PGP signature
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

  1   2   3   >