I still haven't been able to resurrect the 1st host, so I've spent some time 
trying to get the hosted engine stable. I would welcome input on how to fix the 
problematic host so that it can be accessible again.

As per my original email, this all started when I tried to change the 
management vlan. I honestly cannot remember what I did (if anything) to the 
actual hosts when this all started, but my troubleshooting steps today have 
been to try to fiddle with the vlan settings and 
/etc/sysconfig/network-scripts/ files on the problematic host to switch from 
the original vlan (1) to the new vlan (10).

Until then, I'm troubleshooting why the hosted engine isn't really working, 
since the other two hosts are operational.

The hosted engine is "running" -- I can access and navigate around the oVirt 
Manager.
However, it appears that all of the storage domains are down, and all of the 
hosts are "NonOperational". I was, however, able to put two of the hosts into 
Maintenance Mode, including the problematic 1st host.

This is what I see on the 2nd host:

[root@cha2-storage network-scripts]# gluster peer statusNumber of Peers: 2

Hostname: cha1-storage.mgt.example.com
Uuid: 348de1f3-5efe-4e0c-b58e-9cf48071e8e1
State: Peer in Cluster (Disconnected)

Hostname: cha3-storage.mgt.example.com
Uuid: 0563c3e8-237d-4409-a09a-ec51719b0da6
State: Peer in Cluster (Connected)

[root@cha2-storage network-scripts]# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. 
Please ensure that ovirt-ha-agent is running and the storage server is 
reachable.

[root@cha2-storage network-scripts]# hosted-engine --connect-storageTraceback 
(most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/connect_storage_server.py",
 line 30, in <module>
    timeout=ohostedcons.Const.STORAGE_SERVER_TIMEOUT,
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", 
line 312, in connect_storage_server
    sserver.connect_storage_server(timeout=timeout)
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
 line 394, in connect_storage_server
    'Connection to storage server failed'
RuntimeError: Connection to storage server failed

The ovirt-engine-ha service seems to be continuously trying to load / activate, 
but failing:
[root@cha2-storage network-scripts]# systemctl status -l ovirt-ha-agent● 
ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2021-04-07 
20:24:46 EDT; 60ms ago
  Process: 124306 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent 
(code=exited, status=157)
Main PID: 124306 (code=exited, status=157)

Some recent entries in  
/var/log/ovirt-hosted-engine-ha/agent.logMainThread::ERROR::2021-04-07 
20:22:59,115::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Trying to restart agent
MainThread::INFO::2021-04-07 
20:22:59,115::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent 
shutting down
MainThread::INFO::2021-04-07 
20:23:09,717::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) 
ovirt-hosted-engine-ha agent 2.4.6 started
MainThread::INFO::2021-04-07 
20:23:09,742::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
 Certificate common name not found, using hostname to identify host
MainThread::INFO::2021-04-07 
20:23:09,837::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
 Initializing ha-broker connection
MainThread::INFO::2021-04-07 
20:23:09,838::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
 Starting monitor network, options {'addr': '10.1.0.1', 'network_test': 'dns', 
'tcp_t_address': '', 'tcp_t_port': ''}
MainThread::ERROR::2021-04-07 
20:23:09,839::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
 Failed to start necessary monitors
MainThread::ERROR::2021-04-07 
20:23:09,842::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Traceback (most recent call last):
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", 
line 85, in start_monitor
    response = self._proxy.start_monitor(type, options)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
    verbose=self.__verbose
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
    http_conn = self.send_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
    self.send_content(connection, request_body)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
    connection.endheaders(request_body)
  File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.6/http/client.py", line 974, in send
    self.connect()
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 
74, in connect
    self.sock.connect(base64.b16decode(self.host))
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
131, in _run_agent
    return action(he)
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
55, in action_proper
    return he.start_monitoring()
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 437, in start_monitoring
    self._initialize_broker()
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 561, in _initialize_broker
    m.get('options', {}))
  File 
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", 
line 91, in start_monitor
    ).format(t=type, o=options, e=e)
ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to 
start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, 
[monitor: 'network', options: {'addr': '10.1.0.1', 'network_test': 'dns', 
'tcp_t_address': '', 'tcp_t_port': ''}]

MainThread::ERROR::2021-04-07 
20:23:09,842::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Trying to restart agent
MainThread::INFO::2021-04-07 
20:23:09,842::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent 
shutting down

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, April 7, 2021 5:36 PM, David White via Users <users@ovirt.org> 
wrote:

> I'm working on setting up my environment prior to production, and have run 
> into an issue.
> 

> I got most things configured, but due to a limitation on one of my switches, 
> I decided to change the management vlan that the hosts communicate on. Over 
> the course of changing that vlan, I wound up resetting my router to default 
> settings.
> 

> I have the router operational again, and I also have 1 of my switches 
> operational.
> Now, I'm trying to bring the oVirt cluster back online.
> This is oVirt 4.5 running on RHEL 8.3.
> 

> The old vlan is 1, and the new vlan is 10.
> 

> Currently, hosts 2 & 3 are accessible over the new vlan, and can ping each 
> other.
> I'm able to ssh to both hosts, and when I run "gluster peer status", I see 
> that they are connected to each other.
> 

> However, host 1 is not accessible from anything. I can't ping it, and it 
> cannot get out.
> 

> As part of my troubleshooting, I've done the following:
> From the host console, I ran `nmcli connection delete` to delete the old vlan 
> (VLAN 1).
> I moved the /etc/sysconfig/network-scripts/interface.1 file to interface.10, 
> and edited the file accordingly to make sure the vlan and device settings are 
> set to 10 instead of 1, and I rebooted the host.
> 

> The engine seems to be running, but I don't understand why.
> From each of the hosts that are working (host 2 and host 3), I ran 
> "hosted-engine --check-liveliness" and both hosts indicate that the engine is 
> NOT running.
> 

> Yet the engine loads in a web browser, and I'm able to log into 
> /ovirt-engine/webadmin/.
> The engine thinks that all 3 hosts is nonresponsive. See screenshot below:
> 

> [Screenshot from 2021-04-07 17-33-48.png]
> 

> What I'm really looking for help with is to get the first host back online.
> Once it is healthy and gluster is healthy, I feel confident I can get the 
> engine operational again.
> 

> What else should I look for on this host? 
> 

> Sent with ProtonMail Secure Email.

Attachment: publickey - dmwhite823@protonmail.com - 0x320CD582.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/G3C6GUJMUBFA35L5JEKCITX5LUFPFEN3/

Reply via email to