Re: [Gluster-infra] [Gluster-devel] Infra-related Regression Failures and What We're Doing

Atin Mukherjee Tue, 23 Jan 2018 18:13:43 -0800

Both the tests are now marked as bad since there's has been more than one
instance where these tests have failed even after fixing the infra problem.
Request geo-rep team to take a look at and revive the tests back soon.


On Tue, Jan 23, 2018 at 2:30 PM, Atin Mukherjee <amukh...@redhat.com> wrote:

>
>
> On Mon, Jan 22, 2018 at 5:13 PM, Nigel Babu <nig...@redhat.com> wrote:
>
>> Update: All the nodes that had problems with geo-rep are now fixed.
>> Waiting on the patch to be merged before we switch over to Centos 7. If
>> things go well, we'll replace nodes one by one as soon as we have one green
>> on Centos 7.
>>
>
> I just noticed we failed again on the geo-rep tests @
> https://build.gluster.org/job/centos6-regression/8604/console . Nigel
> reconfirmed that we have all the machines cleaned up. What else could be
> going wrong here?
>
>
>> On Mon, Jan 22, 2018 at 12:21 PM, Nigel Babu <nig...@redhat.com> wrote:
>>
>>> Hello folks,
>>>
>>> As you may have noticed, we've had a lot of centos6-regression failures
>>> lately. The geo-replication failures are the new ones which particularly
>>> concern me. These failures have nothing to do with the test. The tests are
>>> exposing a problem in our infrastructure that we've carried around for a
>>> long time. Our machines are not clean machines that we automated. We setup
>>> automation on machines that were already created. At some point, we loaned
>>> machines for debugging. During this time, developers have inadvertently
>>> done 'make install' on the system to install onto system paths rather than
>>> into /build/install. This is what is causing the geo-replication tests
>>> to fail. I've tried cleaning the machines up several times with little to
>>> no success.
>>>
>>> Last week, we decided to take an aggressive path to fix this problem. We
>>> planned to replace all our problematic nodes with new Centos 7 nodes. This
>>> exposed more problems. We expected a specific type of machine from
>>> Rackspace. These are no longer offered. Thus, our automation fails on some
>>> steps. I've spent this weekend tweaking our automation so that it works
>>> on the new Rackspace machines and I'm down to just one test failure[1].
>>> I have a patch up to fix this failure[2]. As soon as that patch is
>>> merged, we can push forward with Centos7 nodes. In 4.0, we're dropping
>>> support for Centos 6, so this decision makes more sense to do sooner than
>>> later.
>>>
>>> We'll not be lending machines anymore from production. We'll be creating
>>> new nodes which are a snapshots of an existing production node. This
>>> machine will be destroyed after use. This helps prevent this particular
>>> problem in the future. This also means that our machine capacity at all
>>> times is at 100 with very minimal wastage.
>>>
>>> [1]: https://build.gluster.org/job/cage-test/184/consoleText
>>> [2]: https://review.gluster.org/#/c/19262/
>>>
>>> --
>>> nigelb
>>>
>>
>>
>>
>> --
>> nigelb
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> gluster-de...@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>

_______________________________________________
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Infra-related Regression Failures and What We're Doing

Reply via email to