Hi folks! Since we have the fine folks from IBM and pschindl involved
with openqa now, and I'm trying to decrease my bus factor, I thought
I'd pass along a few notes about one of the scariest bits of our openQA
deployment...

The tests which require multiple VMs to communicate with each other use
software-defined networking via openvswitch to achieve this. The
upstream docs for this are:

https://github.com/os-autoinst/openQA/blob/master/docs/Networking.asciidoc

Broadly speaking our deployments just implement that stuff via Ansible,
but it's pretty complex, so here's some notes.

The main bit of the ansible config is in
roles/openqa/worker/tasks/tap-setup.yml , with associated files and
templates in roles/openqa/worker/{files,templates} . But especially for
this networking stuff, there are actually important bits elsewhere that
are harder to find.

We need special iptables rules to make the magic happen, and they get
done somewhere else. First off, there's a mechanism in the infra
ansible plays that lets you specify custom rules as an ansible
variable, and we do this in inventory/group_vars/openqa-tap-workers -
look at that file and you'll see some custom rules. However, this
mechanism isn't actually flexible enough to let us add all the rules we
need, so we actually have a variant of the basic iptables template
file: roles/base/templates/iptables/iptables.openqa-tap-workers . You
can diff this against roles/base/templates/iptables/iptables to see
what we change: basically we add some masquerade rules at the end.
These can't be added as custom rules because they have to go in the nat
table - if we changed the nat table in custom rules, then the
"otherwise kick everything out" bit of the template wouldn't work
correctly (as it wouldn't be applied to the right table).

Why did this come up, you ask? Well, we just had to re-deploy qa09 -
which is the x86_64 tap worker host - and it wasn't quite working right
for these tap tests, with a strange failure mode (*some* network
connections from the worker VMs would work fine, others would just
stall). Also, tap networking wasn't working right on the new ppc64
worker host at all, no connections from within the worker VMs were
working.

After a lot of time bashing my head against the wall, I finally figured
out what was going on: it's all about network interfaces on the host.
There are a couple of bits in the iptables stuff which specifically
refer to what should be the active interface on the host. Up till
today, these all specified eth0.

When qa09 got re-deployed, for some reason, it had active network
connections on *both* eth0 *and* eth1. This was causing the weird
behaviour - it looks like openvswitch was deciding more or less at
random to route traffic from the guests over eth0 or eth1, and any
traffic that got routed over eth0 worked, but traffic routed over eth1
just didn't because the firewall wasn't set up to allow it.

There's actually even a bit in tap-setup.yml that tries to disable
eth1, but it didn't work because it relies on adding ONBOOT=no to
ifcfg-eth1 if it exists, but ifcfg-eth1 *didn't* exist. NM was just
bringing it up entirely without a config file, it seems. So for now I
just manually created an ifcfg-eth1 on qa09 which specifies ONBOOT=no .

On the ppc64 worker host, the active network interface isn't eth0, it's
eth2. So to account for this, I changed the custom iptables stuff to
allow everything for both eth0 *and* eth2 (this involved changing both
the custom rules and the modified template).

With these changes, the 'tap' networking tests seem to be working
properly on both qa09 and openqa-ppc64le-01.

So the moral of the story is, if I'm off on a desert island and this
stuff starts giving you trouble for some reason, remember about all the
different places where we have config for it in the ansible plays, and
check the active interfaces on the problematic host...
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
_______________________________________________
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org

Reply via email to