[ovirt-users] Re: Problems after 4.3.8 update

Strahil Nikolov Sat, 14 Dec 2019 15:58:23 -0800

 Nah... this is not gonna fix your issue and is unnecessary.Just compare the 
data from all bricks ... most probably the 'Last Updated' is different and the 
gfid of the file is different.Find the brick that has the most fresh data, and 
replace (move away as a backup and rsync) the file from last good copy to the 
other bricks.You can also run a 'full heal'.
Best Regards,Strahil Nikolov
    В събота, 14 декември 2019 г., 21:18:44 ч. Гринуич+2, Jayme 
<jay...@gmail.com> написа:  
 
 *Update* 
Situation has improved.  All VMs and engine are running.  I'm left right now 
with about 2 heal entries in each glusterfs storage volume that will not heal. 
In all cases each heal entry is related to an OVF_STORE image and the problem 
appears to be an issue with the gluster metadata for those ovf_store images.  
When I look at the files shown in gluster volume heal info output I'm seeing 
question marks on the meta files which indicates an attribute/gluster problem 
(even though there is no split-brain).  And I get input/output error when 
attempting to do anything with the files.
If I look at the files on each host in /gluster_bricks they all look fine.  I 
only see question marks on the meta files when look at the file in /rhev mounts
Does anyone know how I can correct the attributes on these OVF_STORE files?  
I've tried putting each host in maintenance and re-activating to re-mount 
gluster volumes.  I've also stopped and started all gluster volumes.  
I'm thinking I might be able to solve this by shutting down all VMs and placing 
all hosts in maintenance and safely restarting the entire cluster.. but that 
may not be necessary?  
On Fri, Dec 13, 2019 at 12:59 AM Jayme <jay...@gmail.com> wrote:


I believe I was able to get past this by stopping the engine volume then 
unmounting the glusterfs engine mount on all hosts and re-starting the volume. 
I was able to start hostedengine on host0.
I'm still facing a few problems:
1. I'm still seeing this issue in each host's logs:
Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed 
scanning for OVF_STORE due to Command Volume.getInfo with args 
{'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 
'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID': 
u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID': 
u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, message=Volume 
does not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',))
Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Unable 
to identify the OVF_STORE volume, falling back to initial vm.conf. Please 
ensure you already added your first data domain for regular VMs


2. Most of my gluster volumes still have un-healed entires which I can't seem 
to heal.  I'm not sure what the answer is here.
On Fri, Dec 13, 2019 at 12:33 AM Jayme <jay...@gmail.com> wrote:

I was able to get the hosted engine started manually via Virsh after 
re-creating a missing symlink in /var/run/vdsm/storage -- I later shut it down 
and am still having the same problem with ha broker starting.  It appears that 
the problem *might* be with a corrupt ha metadata file, although gluster is not 
stating there is split brain on the engine volume
I'm seeing this:
ls -al 
/rhev/data-center/mnt/glusterSD/orchard0\:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/
ls: cannot access 
/rhev/data-center/mnt/glusterSD/orchard0:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/hosted-engine.metadata:
 Input/output error
total 0
drwxr-xr-x. 2 vdsm kvm  67 Dec 13 00:30 .
drwxr-xr-x. 6 vdsm kvm  64 Aug  6  2018 ..
lrwxrwxrwx. 1 vdsm kvm 132 Dec 13 00:30 hosted-engine.lockspace -> 
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d
l?????????? ? ?    ?     ?            ? hosted-engine.metadata

Clearly showing some sort of issue with hosted-engine.metadata on the client 
mount.  
on each node in /gluster_bricks I see this:
# ls -al 
/gluster_bricks/engine/engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/
total 0
drwxr-xr-x. 2 vdsm kvm  67 Dec 13 00:31 .
drwxr-xr-x. 6 vdsm kvm  64 Aug  6  2018 ..
lrwxrwxrwx. 2 vdsm kvm 132 Dec 13 00:31 hosted-engine.lockspace -> 
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d
lrwxrwxrwx. 2 vdsm kvm 132 Dec 12 16:30 hosted-engine.metadata -> 
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c

 ls -al 
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
-rw-rw----. 1 vdsm kvm 1073741824 Dec 12 16:48 
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c


I'm not sure how to proceed at this point.  Do I have data corruption, a 
gluster split-brain issue or something else?  Maybe I just need to re-generate 
metadata for the hosted engine?
On Thu, Dec 12, 2019 at 6:36 PM Jayme <jay...@gmail.com> wrote:

I'm running a three server HCI.  Up and running on 4.3.7 with no problems.  
Today I updated to 4.3.8.  Engine upgraded fine, rebooted.  First host updated 
fine, rebooted and let all gluster volumes heal.  Put second host in 
maintenance, upgraded successfully, rebooted.  Waited for gluster volumes to 
heal for over an hour but the heal process was not completing.  I tried 
restarting gluster services as well as the host with no success.
I'm in a state right now where there are pending heals on almost all of my 
volumes.  Nothing is reporting split-brain, but the heals are not completing. 
All vms are still currently running except hosted engine.  Hosted engine was 
running but on the 2nd host I upgraded I was seeing errors such as:
Dec 12 16:34:39 orchard2 journal: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed 
scanning for OVF_STORE due to Command Volume.getInfo with args 
{'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 
'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID': 
u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID': 
u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, message=Volume 
does not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',))

I shut down the engine VM and attempted a manual heal on the engine volume.  I 
cannot start the engine on any host now.  I get:
The hosted engine configuration has not been retrieved from shared storage. 
Please ensure that ovirt-ha-agent is running and the storage server is 
reachable.

I'm seeing ovirt-ha-agent crashing on all three nodes:
Dec 12 18:30:48 orchard0 python: detected unhandled Python exception in 
'/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
Dec 12 18:30:48 orchard0 abrt-server: Duplicate: core backtrace
Dec 12 18:30:48 orchard0 abrt-server: DUP_OF_DIR: 
/var/tmp/abrt/Python-2019-03-14-21:02:52-44318
Dec 12 18:30:48 orchard0 abrt-server: Deleting problem directory 
Python-2019-12-12-18:30:48-23193 (dup of Python-2019-03-14-21:02:52-44318)
Dec 12 18:30:49 orchard0 vdsm[6087]: ERROR failed to retrieve Hosted Engine HA 
score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished?
Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service: main process exited, 
code=exited, status=1/FAILURE
Dec 12 18:30:49 orchard0 systemd: Unit ovirt-ha-broker.service entered failed 
state.
Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service failed.
Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service holdoff time over, 
scheduling restart.
Dec 12 18:30:49 orchard0 systemd: Cannot add dependency job for unit 
lvm2-lvmetad.socket, ignoring: Unit is masked.
Dec 12 18:30:49 orchard0 systemd: Stopped oVirt Hosted Engine High Availability 
Communications Broker.


Here is what gluster volume heal info on engine looks like, it's similar on 
other volumes as well (although more heals pending on some of those):
 gluster volume heal engine info
Brick gluster0:/gluster_bricks/engine/engine
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta
/d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup
/d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids
Status: Connected
Number of entries: 4

Brick gluster1:/gluster_bricks/engine/engine
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta
/d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta
/d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids
Status: Connected
Number of entries: 4

Brick gluster2:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0

I don't see much in vdsm.log and gluster logs look fairly normal to me, I'm not 
seeing any obvious errors in the gluster logs. 
As far as I can tell the underlying storage is fine.  Why are my gluster 
volumes not healing and why is self-hosted engine failing to start?
More agent and broker logs:
==> agent.log <==
MainThread::ERROR::2019-12-12 
18:36:09,056::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
 Failed to start necessary monitors
MainThread::ERROR::2019-12-12 
18:36:09,058::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Traceback (most recent call last):
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
131, in _run_agent
    return action(he)
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
55, in action_proper
    return he.start_monitoring()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 432, in start_monitoring
    self._initialize_broker()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 556, in _initialize_broker
    m.get('options', {}))
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", 
line 89, in start_monitor
    ).format(t=type, o=options, e=e)
RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 
2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': 
None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}]

MainThread::ERROR::2019-12-12 
18:36:09,058::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Trying to restart agent
MainThread::ERROR::2019-12-12 
18:36:19,619::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
 Failed to start necessary monitors
MainThread::ERROR::2019-12-12 
18:36:19,619::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Traceback (most recent call last):
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
131, in _run_agent
    return action(he)
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
55, in action_proper
    return he.start_monitoring()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 432, in start_monitoring
    self._initialize_broker()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 556, in _initialize_broker
    m.get('options', {}))
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", 
line 89, in start_monitor
    ).format(t=type, o=options, e=e)
RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 
2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': 
None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}]

MainThread::ERROR::2019-12-12 
18:36:19,619::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Trying to restart agent
MainThread::ERROR::2019-12-12 
18:36:30,568::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
 Failed to start necessary monitors
MainThread::ERROR::2019-12-12 
18:36:30,570::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Traceback (most recent call last):
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
131, in _run_agent
    return action(he)
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
55, in action_proper
    return he.start_monitoring()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 432, in start_monitoring
    self._initialize_broker()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 556, in _initialize_broker
    m.get('options', {}))
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", 
line 89, in start_monitor
    ).format(t=type, o=options, e=e)
RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 
2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': 
None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}]

MainThread::ERROR::2019-12-12 
18:36:30,570::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Trying to restart agent
MainThread::ERROR::2019-12-12 
18:36:41,581::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
 Failed to start necessary monitors
MainThread::ERROR::2019-12-12 
18:36:41,583::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Traceback (most recent call last):
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
131, in _run_agent
    return action(he)
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 
55, in action_proper
    return he.start_monitoring()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 432, in start_monitoring
    self._initialize_broker()
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
 line 556, in _initialize_broker
    m.get('options', {}))
  File 
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", 
line 89, in start_monitor
    ).format(t=type, o=options, e=e)
RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 
2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': 
None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}]

MainThread::ERROR::2019-12-12 
18:36:41,583::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
 Trying to restart agent





_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/U5YFDWCQJYNALSVNPZG4FLUO7KB2Z2XI/

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PKZS2YHQ2HKUXOLHKMJB4QR27XRM6VG6/

[ovirt-users] Re: Problems after 4.3.8 update

Reply via email to