Hello folks,

As you may have noticed, we've had a lot of centos6-regression failures
lately. The geo-replication failures are the new ones which particularly
concern me. These failures have nothing to do with the test. The tests are
exposing a problem in our infrastructure that we've carried around for a
long time. Our machines are not clean machines that we automated. We setup
automation on machines that were already created. At some point, we loaned
machines for debugging. During this time, developers have inadvertently
done 'make install' on the system to install onto system paths rather than
into /build/install. This is what is causing the geo-replication tests to
fail. I've tried cleaning the machines up several times with little to no
success.

Last week, we decided to take an aggressive path to fix this problem. We
planned to replace all our problematic nodes with new Centos 7 nodes. This
exposed more problems. We expected a specific type of machine from
Rackspace. These are no longer offered. Thus, our automation fails on some
steps. I've spent this weekend tweaking our automation so that it works on
the new Rackspace machines and I'm down to just one test failure[1]. I have
a patch up to fix this failure[2]. As soon as that patch is merged, we can
push forward with Centos7 nodes. In 4.0, we're dropping support for Centos
6, so this decision makes more sense to do sooner than later.

We'll not be lending machines anymore from production. We'll be creating
new nodes which are a snapshots of an existing production node. This
machine will be destroyed after use. This helps prevent this particular
problem in the future. This also means that our machine capacity at all
times is at 100 with very minimal wastage.

[1]: https://build.gluster.org/job/cage-test/184/consoleText
[2]: https://review.gluster.org/#/c/19262/

-- 
nigelb
_______________________________________________
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Reply via email to