On my silent Atom based three node Hyperconverged journey I hit upon a snag: 
Evidently they are too slow for Ansible.

The Gluster storage part went all great and perfect on fresh oVirt node images 
that I had configured to leave an empty partition instead of the standard 
/dev/sdb, but the HostedEngine setup part would then fail without any 
log-visible error while the transient VM HostedEngineLocal was supposed to be 
launched and the Wizard would just show "deployment failed" and go ahead and 
delete the VM.

I then moved the SSD to a more powerful Xeon-D 1541 CPU and after some fiddling 
with the network (I miss good old eth0!), this also failed the deployment, but 
also failed to delete the temporary VM image, because that actually turned out 
to be running: I could even connect to its console and investigate the logs for 
any clues as to what might have gone wrong (nothing visible): Evidently Ansible 
was running out of patience just a tiny bit too early.

And then I kicked it into high-gear with an i7-7700K again using the same SSD 
with a working three node Gluster all in sync, which still took what felt like 
an hour to creep through every step, but got it done, primary node on i7, 
secondary nodes on Atoms, with full migration capabilities etc.

I then had to do some fiddling, because the HostedEngine had configued the 
Cluster CPU architecture to Skylake-Spectre, but after that I migrated it to an 
Atom node and was now ready to move the primary to the intended Atom hardware 
target.

But at that point the overlay network has already been configured and evidently 
it's tied to the device name of the 10Gbit NIC on the i7 workstation and I 
haven't been able to make it work with the Atom. The Gluster runs fine, but the 
CPU node is reported "non-operational" and re-installation fails, because the 
ovirtmgmt network isn't properly configured.

That specific issue may be seem way out of what oVirt should support, yet a 
HA-embedded edge platform may very well see nodes having to be replaced or 
renewed with as little interruption or downtime as possible, which is why I am 
asking the larger question:

How can you replace a) a failed "burned" node or b) upgrade nodes while 
maintaining fault tolerance?

The distinction in b) would be that it's a planned maneuver during normal 
operations without downtime.

I'd want to do it pretty much like I have been playing with compute nodes, 
creating new ones, pushing VMs on them, pushing them out to other hosts, 
removing and replacing them seamlessly... Except that the Gluster nodes are 
special and much harder to replace, than a pure Gluster storage brick... from 
what I see

I welcome any help 
- for fixing the network config in my limiping Atom 1:3 cluster
- eliminating the need to fiddle with an i7 because of Ansible timing
- ensuring long-term operability of a software defined datacenter with changing 
hardware
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ICV5HRJFHWENZNBCQCFNIL7NDD7YYD33/

Reply via email to