
After a few days of trial, error, and madness - I *think* I found the
source of my problem. Or at least I can now replicate it reliably. These
are the basics of my speed-run-to-test-failures setup.

Fresh minimal install of Scientific Linux 7.4 on a physical host for my
engine. Add the 4.2 repo and run engine-setup - just blast through the
defaults. Configure it with default DC and cluster.

Fresh minimal install of Scientific Linux 7.4 on node1 - configure only
the primary network card. Add the ovirt repo.

Add the host into cluster. Provisions just fine. Life is good.

Now here is where things split.

Scenario 1: build node2 same as node 1 configuring only the primary
network card and add it as a host. Provisions just fine. Life is good.

Scenario 2: Configure a second network. In my case a BMC/IPMI network.
Doesn't matter if it is required or not - both will cause failures
however the errors are slightly more evident with required. Make sure
the network is assigned to your node1 and is properly assigned an IP and
configured in the up state. Now build node2 same as before with only the
primary network configured and add it as a host.

Failure followed by infinite loop of setting it into Non-Operational!

The pop-up gives you some crap about "Host has no default route." but
that is 100% a red-herring.

Dig a little deeper and you get a message like this:
"node2 does not comply with the cluster Default networks, the following
networks are missing on host: 'ovirtmgmt'"

Ah. That's a bit more relevant, but why can't it configure it? Or at
least get to the point where it asks me "Hey, networking is a bit off -
do you want to configure that now?" That would be nice...

Fortunately the troubleshooting guide has something about that!

Unfortunately, it doesn't do anything to help. Even after doing these
steps, the loop just keeps going...nothing changes.

Scratch it all and completely rebuild AGAIN for...
Scenario 3: Configure a second network (BMC) and assign it to node1 just
like before. Build out node2 same as node1 but this time add in the
ifcfg-* files (but update the IP address to correct host, obviously).
Now add it as a host.

Doh! Same error. :-/

OK fine. Let's really get into it. First off, the networking page for
the host is blank. It never pulls back the network cards so you can't
actually make changes via the web page. Nor can you assign networks. So
the web interface doesn't help at all.

Let's look at the engine log instead.

2018-04-17 14:33:00,336-05 INFO
(EE-ManagedThreadFactory-engine-Thread-1091) []
ResourceManager::vdsNotResponding entered for Host
'f0a3d515-8ba2-490e-8d65-54edbb52cefc', ''
2018-04-17 14:33:00,360-05 INFO
(EE-ManagedThreadFactory-engine-Thread-1091) [5291eee5] Lock Acquired to
2018-04-17 14:33:00,388-05 ERROR
(EE-ManagedThreadFactory-engineScheduled-Thread-44) [2b853e43] Host
'node2' is set to Non-Operational, it is missing the following networks:
2018-04-17 14:33:00,403-05 WARN
(EE-ManagedThreadFactory-engineScheduled-Thread-44) [2b853e43] EVENT_ID:
VDS_SET_NONOPERATIONAL_NETWORK(519), Host node2 does not comply with the
cluster Default networks, the following networks are missing on host:
2018-04-17 14:33:00,407-05 INFO
(EE-ManagedThreadFactory-engine-Thread-1091) [5291eee5] Running command:
VdsNotRespondingTreatmentCommand internal: true. Entities affected :
ID: f0a3d515-8ba2-490e-8d65-54edbb52cefc Type: VDS

There's the message from before. Good. On the right track. Not sure why
it thinks the host is unreachable because the host is just fine.

2018-04-17 14:33:01,978-05 ERROR
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Command
'GetAllVmStatsVDSCommand(HostName = node2,
execution failed: No route to host

Huh. Again with the no route to host. But THERE IS! The network is
functioning perfectly. IP's all work. DNS all works. Routing is fine. I
have no idea what it is complaining about.

2018-04-17 14:33:03,873-05 INFO
(EE-ManagedThreadFactory-engineScheduled-Thread-39) [4f72afaa] START,
SetVdsStatusVDSCommand(HostName = node2,
status='NonOperational', nonOperationalReason='NETWORK_UNREACHABLE',
stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 7459a748

Which network is unreachable? Because every single one of them is fine! Ugh!

I am completely stumped as to why it works perfectly
pre-additional-networks but fails every time after a network is configured.

A couple of questions.

1. I assume people have added hosts _after_ they've configured multiple
networks. So what am I doing wrong? Why am I unable to add a host?
Again, if I don't configure that second network, it will happily add all
my hosts. But what happens when I want to add a host in the future?

2. How do I break that infuriating infinite non-operational loop? I
can't put it into maintenance mode, I can't delete the host, or anything
else. The options are greyed out. The only solution I've found is yank
the power and after it freaks out for about 30 minutes because it can't
find the host, it will stop trying. But I still can't seem to remove the
bad host. There has to be a way via command-line to say "stop timing
out, knock that off, and delete this host!" but I'm not finding it in my

3. I feel like I go through periods with oVirt where everything is
running exactly the way I want then something happens (like me trying to
add a host! Or thinking I can just change a host IP without the whole
thing dying on me!) and it all just falls apart. I feel like I am just
stumbling through most of it. I've previously gotten a lot out of the
Red Hat classes and work has offered to send me to a training of my
choice this year. I am really considering taking the 318 Virtualization
class. I'm curious though, how close is that to what I would be working
with oVirt? I'm guessing that since 4.2 recently came out, there is
probably minimal chance the class will be over 4.2 but maybe it is close
enough? I would love to hear feedback.


Attachment: signature.asc
Description: OpenPGP digital signature

Users mailing list

Reply via email to