For the record, once I added a new storage domain the Data center came up. So in the end, this seems to have been due to known bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1160667 https://bugzilla.redhat.com/show_bug.cgi?id=1160423 Effectively, for hosts with static/manual IP addressing (i.e. not DHCP), the DNS and default route information are not set up correctly by hosted-engine-setup. I'm not sure why that's not considered a higher priority bug (e.g. blocker for 3.5.2?) since I believe the most typical configuration for servers is static IP addressing. All seems to be working now. Many thanks to Simone for the invaluable assistance. -Bob On Mar 10, 2015 2:29 PM, "Bob Doolittle" <b...@doolittle.us.com <mailto:b...@doolittle.us.com>> wrote: > > > On 03/10/2015 10:20 AM, Simone Tiraboschi wrote: >> >> >> ----- Original Message ----- >>> >>> From: "Bob Doolittle" <b...@doolittle.us.com <mailto:b...@doolittle.us.com>> >>> To: "Simone Tiraboschi" <stira...@redhat.com <mailto:stira...@redhat.com>> >>> Cc: "users-ovirt" <users@ovirt.org <mailto:users@ovirt.org>> >>> Sent: Tuesday, March 10, 2015 2:40:13 PM >>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on >>> F20 (The VDSM host was found in a failed >>> state) >>> >>> >>> On 03/10/2015 04:58 AM, Simone Tiraboschi wrote: >>>> >>>> ----- Original Message ----- >>>>> >>>>> From: "Bob Doolittle" <b...@doolittle.us.com >>>>> <mailto:b...@doolittle.us.com>> >>>>> To: "Simone Tiraboschi" <stira...@redhat.com <mailto:stira...@redhat.com>> >>>>> Cc: "users-ovirt" <users@ovirt.org <mailto:users@ovirt.org>> >>>>> Sent: Monday, March 9, 2015 11:48:03 PM >>>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on >>>>> F20 (The VDSM host was found in a failed >>>>> state) >>>>> >>>>> >>>>> On 03/09/2015 02:47 PM, Bob Doolittle wrote: >>>>>> >>>>>> Resending with CC to list (and an update). >>>>>> >>>>>> On 03/09/2015 01:40 PM, Simone Tiraboschi wrote: >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> >>>>>>>> From: "Bob Doolittle" <b...@doolittle.us.com >>>>>>>> <mailto:b...@doolittle.us.com>> >>>>>>>> To: "Simone Tiraboschi" <stira...@redhat.com >>>>>>>> <mailto:stira...@redhat.com>> >>>>>>>> Cc: "users-ovirt" <users@ovirt.org <mailto:users@ovirt.org>> >>>>>>>> Sent: Monday, March 9, 2015 6:26:30 PM >>>>>>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 >>>>>>>> on >>>>>>>> F20 (Cannot add the host to cluster ... SSH >>>>>>>> has failed) >>>>>>>> >>> ... >>>>>>>> >>>>>>>> OK, I've started over. Simply removing the storage domain was >>>>>>>> insufficient, >>>>>>>> the hosted-engine deploy failed when it found the HA and Broker >>>>>>>> services >>>>>>>> already configured. I decided to just start over fresh starting with >>>>>>>> re-installing the OS on my host. >>>>>>>> >>>>>>>> I can't deploy DNS at the moment, so I have to simply replicate >>>>>>>> /etc/hosts >>>>>>>> files on my host/engine. I did that this time, but have run into a new >>>>>>>> problem: >>>>>>>> >>>>>>>> [ INFO ] Engine replied: DB Up!Welcome to Health Status! >>>>>>>> Enter the name of the cluster to which you want to add the >>>>>>>> host >>>>>>>> (Default) [Default]: >>>>>>>> [ INFO ] Waiting for the host to become operational in the engine. >>>>>>>> This >>>>>>>> may >>>>>>>> take several minutes... >>>>>>>> [ ERROR ] The VDSM host was found in a failed state. Please check >>>>>>>> engine >>>>>>>> and >>>>>>>> bootstrap installation logs. >>>>>>>> [ ERROR ] Unable to add ovirt-vm to the manager >>>>>>>> Please shutdown the VM allowing the system to launch it as a >>>>>>>> monitored service. >>>>>>>> The system will wait until the VM is down. >>>>>>>> [ ERROR ] Failed to execute stage 'Closing up': [Errno 111] Connection >>>>>>>> refused >>>>>>>> [ INFO ] Stage: Clean up >>>>>>>> [ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection >>>>>>>> refused >>>>>>>> >>>>>>>> >>>>>>>> I've attached my engine log and the ovirt-hosted-engine-setup log. I >>>>>>>> think I >>>>>>>> had an issue with resolving external hostnames, or else a connectivity >>>>>>>> issue >>>>>>>> during the install. >>>>>>> >>>>>>> For some reason your engine wasn't able to deploy your hosts but the SSH >>>>>>> session this time was established. >>>>>>> 2015-03-09 13:05:58,514 ERROR >>>>>>> [org.ovirt.engine.core.bll.InstallVdsInternalCommand] >>>>>>> (org.ovirt.thread.pool-8-thread-3) [3cf91626] Host installation failed >>>>>>> for host 217016bb-fdcd-4344-a0ca-4548262d10a8, ovirt-vm.: >>>>>>> java.io.IOException: Command returned failure code 1 during SSH session >>>>>>> 'r...@xion2.smartcity.net <mailto:r...@xion2.smartcity.net>' >>>>>>> >>>>>>> Can you please attach host-deploy logs from the engine VM? >>>>>> >>>>>> OK, attached. >>>>>> >>>>>> Like I said, it looks to me like a name-resolution issue during the yum >>>>>> update on the engine. I think I've fixed that, but do you have a better >>>>>> suggestion for cleaning up and re-deploying other than installing the OS >>>>>> on my host and starting all over again? >>>>> >>>>> I just finished starting over from scratch, starting with OS installation >>>>> on >>>>> my host/node, and wound up with a very similar problem - the engine >>>>> couldn't >>>>> reach the hosts during the yum operation. But this time the error was >>>>> "Network is unreachable". Which is weird, because I can ssh into the >>>>> engine >>>>> and ping many of those hosts, after the operation has failed. >>>>> >>>>> Here's my latest host-deploy log from the engine. I'd appreciate any >>>>> clues. >>>> >>>> It seams that now your host is able to resolve that addresses but it's not >>>> able to connect over http. >>>> On your hosts some of them resolves as IPv6 addresses; can you please try >>>> to use curl to get one of the file that it wasn't able to fetch? >>>> Can you please check your network configuration before and after >>>> host-deploy? >>> >>> I can give you the network configuration after host-deploy, at least for the >>> host/Node. The engine won't start for me this morning, after I shut down the >>> host for the night. >>> >>> In order to give you the config before host-deploy (or, apparently for the >>> engine), I'll have to re-install the OS on the host and start again from >>> scratch. Obviously I'd rather not do that unless absolutely necessary. >>> >>> Here's the host config after the failed host-deploy: >>> >>> Host/Node: >>> >>> # ip route >>> 169.254.0.0/16 <http://169.254.0.0/16> dev ovirtmgmt scope link metric >>> 1007 >>> 172.16.0.0/16 <http://172.16.0.0/16> dev ovirtmgmt proto kernel scope >>> link src 172.16.0.58 >> >> You are missing a default gateway and so the issue. >> Are you sure that it was properly configured before trying to deploy that >> host? > > > It should have been, it was a fresh OS install. So I'm starting again, and > keeping careful records of my network config. > > Here is my initial network config of my host/node, immediately following a > new OS install: > > % ip route > default via 172.16.0.1 dev p3p1 proto static metric 1024 > 172.16.0.0/16 <http://172.16.0.0/16> dev p3p1 proto kernel scope link src > 172.16.0.58 > > % ip addr > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group > default > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP > group default qlen 1000 > link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff > inet 172.16.0.58/16 <http://172.16.0.58/16> brd 172.16.255.255 scope > global p3p1 > valid_lft forever preferred_lft forever > inet6 fe80::baca:3aff:fe79:2212/64 scope link > valid_lft forever preferred_lft forever > 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN > group default qlen 1000 > link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff > > > After the VM is first created, the host/node config is: > > # ip route > default via 172.16.0.1 dev ovirtmgmt > 169.254.0.0/16 <http://169.254.0.0/16> dev ovirtmgmt scope link metric 1006 > 172.16.0.0/16 <http://172.16.0.0/16> dev ovirtmgmt proto kernel scope link > src 172.16.0.58 > > # ip addr > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group > default > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master > ovirtmgmt state UP group default qlen 1000 > link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff > inet6 fe80::baca:3aff:fe79:2212/64 scope link > valid_lft forever preferred_lft forever > 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN > group default qlen 1000 > link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff > 4: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue > state DOWN group default > link/ether 92:cb:9d:97:18:36 brd ff:ff:ff:ff:ff:ff > 5: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group > default > link/ether 9a:bc:29:52:82:38 brd ff:ff:ff:ff:ff:ff > 6: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state > UP group default > link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff > inet 172.16.0.58/16 <http://172.16.0.58/16> brd 172.16.255.255 scope > global ovirtmgmt > valid_lft forever preferred_lft forever > inet6 fe80::baca:3aff:fe79:2212/64 scope link > valid_lft forever preferred_lft forever > 7: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master > ovirtmgmt state UNKNOWN group default qlen 500 > link/ether fe:16:3e:16:a4:37 brd ff:ff:ff:ff:ff:ff > inet6 fe80::fc16:3eff:fe16:a437/64 scope link > valid_lft forever preferred_lft forever > > > At this point, I was already seeing a problem on the host/node. I remembered > that a newer version of sos package is delivered from the ovirt repositories. > So I tried to do a "yum update" on my host, and got a similar problem: > > % sudo yum update > [sudo] password for rad: > Loaded plugins: langpacks, refresh-packagekit > Resolving Dependencies > --> Running transaction check > ---> Package sos.noarch 0:3.1-1.fc20 will be updated > ---> Package sos.noarch 0:3.2-0.2.fc20.ovirt will be an update > --> Finished Dependency Resolution > > Dependencies Resolved > > ================================================================================================================ > Package Arch Version > Repository Size > ================================================================================================================ > Updating: > sos noarch 3.2-0.2.fc20.ovirt > ovirt-3.5 292 k > > Transaction Summary > ================================================================================================================ > Upgrade 1 Package > > Total download size: 292 k > Is this ok [y/d/N]: y > Downloading packages: > No Presto metadata available for ovirt-3.5 > sos-3.2-0.2.fc20.ovirt.noarch. FAILED > http://www.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: > [Errno 14] curl#6 - "Could not resolve host: www.gtlib.gatech.edu > <http://www.gtlib.gatech.edu>" > Trying other mirror. > sos-3.2-0.2.fc20.ovirt.noarch. FAILED > ftp://ftp.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: > [Errno 14] curl#6 - "Could not resolve host: ftp.gtlib.gatech.edu > <http://ftp.gtlib.gatech.edu>" > Trying other mirror. > sos-3.2-0.2.fc20.ovirt.noarch. FAILED > http://resources.ovirt.org/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: > [Errno 14] curl#6 - "Could not resolve host: resources.ovirt.org > <http://resources.ovirt.org>" > Trying other mirror. > sos-3.2-0.2.fc20.ovirt.noarch. FAILED > http://ftp.snt.utwente.nl/pub/software/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: > [Errno 14] curl#6 - "Could not resolve host: ftp.snt.utwente.nl > <http://ftp.snt.utwente.nl>" > Trying other mirror. > sos-3.2-0.2.fc20.ovirt.noarch. FAILED > http://ftp.nluug.nl/os/Linux/virtual/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: > [Errno 14] curl#6 - "Could not resolve host: ftp.nluug.nl > <http://ftp.nluug.nl>" > Trying other mirror. > sos-3.2-0.2.fc20.ovirt.noarch. FAILED > http://mirror.linux.duke.edu/ovirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: > [Errno 14] curl#6 - "Could not resolve host: mirror.linux.duke.edu > <http://mirror.linux.duke.edu>" > Trying other mirror. > > > Error downloading packages: > sos-3.2-0.2.fc20.ovirt.noarch: [Errno 256] No more mirrors to try. > > > This was similar to my previous failures. I took a look, and the problem was > that /etc/resolv.conf had no nameservers, and the > /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt file contained no entries for > DNS1 or DOMAIN. > > So, it appears that when hosted-engine set up my bridged network, it > neglected to carry over the DNS configuration necessary to the bridge. > > Note that I am using *static* network configuration, rather than DHCP. During > installation of the OS I am setting up the network configuration as Manual. > Perhaps the hosted-engine script is not properly prepared to deal with that? > > I went ahead and modified the ifcfg-ovirtmgmt network script (for the next > service restart/boot) and resolv.conf (I was afraid to restart the network in > the middle of hosted-engine execution since I don't know what might already > be connected to the engine). This time it got further, but ultimately it > still failed at the very end: > > [ INFO ] Waiting for the host to become operational in the engine. This may > take several minutes... > [ INFO ] Still waiting for VDSM host to become operational... > [ INFO ] The VDSM Host is now operational > Please shutdown the VM allowing the system to launch it as a > monitored service. > The system will wait until the VM is down. > [ ERROR ] Failed to execute stage 'Closing up': Error acquiring VM status > [ INFO ] Stage: Clean up > [ INFO ] Generating answer file > '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150310140028.conf' > [ INFO ] Stage: Pre-termination > [ INFO ] Stage: Termination > > > At that point, neither the ovirt-ha-broker or ovirt-ha-agent services were > running. > > Note there was no significant pause after it said "The system will wait until > the VM is down". > > After the script completed, I shut down the VM, and manually started the ha > services, and the VM came up. I could login to the Administration Portal, and > finally see my HostedEngine VM. :-) > > I seem to be in a bad state however: The Data Center has no storage domains > attached. I'm not sure what else might need cleaning up. Any assistance > appreciated. > > -Bob > > > >>> # ip addr >>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group >>> default >>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>> inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo >>> valid_lft forever preferred_lft forever >>> inet6 ::1/128 scope host >>> valid_lft forever preferred_lft forever >>> 2: p3p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master >>> ovirtmgmt state UP group default qlen 1000 >>> link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff >>> inet6 fe80::baca:3aff:fe79:2212/64 scope link >>> valid_lft forever preferred_lft forever >>> 3: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue >>> state DOWN group default >>> link/ether 56:56:f7:cf:73:27 brd ff:ff:ff:ff:ff:ff >>> 4: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN >>> group default qlen 1000 >>> link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff >>> 6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group >>> default >>> link/ether 22:a1:01:9e:30:71 brd ff:ff:ff:ff:ff:ff >>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state >>> UP group default >>> link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff >>> inet 172.16.0.58/16 <http://172.16.0.58/16> brd 172.16.255.255 scope >>> global ovirtmgmt >>> valid_lft forever preferred_lft forever >>> inet6 fe80::baca:3aff:fe79:2212/64 scope link >>> valid_lft forever preferred_lft forever >>> >>> >>> The only unusual thing about my setup that I can think of, from the network >>> perspective, is that my physical host has a wireless interface, which I've >>> not configured. Could it be confusing hosted-engine --deploy? >>> >>> -Bob >>> >>> >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users