Hi Strahil, Thank you for your reply. I found the issue, the not connected errors seem to appear from the ACL layer. and somehow it received a permission denied, and this was translated to a not connected error. while the file permission were listed as owner=vdsm and group=kvm, somehow ACL saw this differently. I ran "chown -R vdsm.kvm /rhev/data-center/mnt/glusterSD/10.201.0.11\:_ovirt-mon-2/" on the mount, and suddenly things started working again.
I indeed have (or now had, since for the restore procedure i needed to provide an empty domain) 1 other VM on the HostedEngine domain, this other VM had other critical services like VPN. Since i see the HostedEngine domain as one of the most reliable domains, i used it for critical services. All other VM's have their own domains. I'm a bit surprised by your comment about brick multiplexing, i understood this should actually improve performance, by sharing resources? Would you have some extra information about this? To answer your questions; We currently have 15 physical hosts. 1) there are no pending heals 2) yes i'm able to connect to the ports 3) all peers report as connected 4) Actually i had a setup like this before, i had multiple smaller qcow disks in a raid0 with LVM. But this did appeared not to be reliable, so i switched to 1 single large disk. Would you know if there is some documentation about this? 5) i'm running about the latest and greatest stable; 4.3.7.2-1.el7. Only had trouble with the restore, because the cluster was still in compatibility mode 4.2 and there were 2 older VM's which had snapshots from prior versions, while the leaf was in compatibility level 4.2. note; the backup was taken on the engine running 4.3. Thanks Olaf Op di 28 jan. 2020 om 17:31 schreef Strahil Nikolov <hunter86...@yahoo.com>: > On January 27, 2020 11:49:08 PM GMT+02:00, Olaf Buitelaar < > olaf.buitel...@gmail.com> wrote: > >Dear Gluster users, > > > >i'm a bit at a los here, and any help would be appreciated. > > > >I've lost a couple, since the disks suffered from severe XFS error's > >and of > >virtual machines and some won't boot because they can't resolve the > >size of > >the image as reported by vdsm: > >"VM kube-large-01 is down with error. Exit message: Unable to get > >volume > >size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume > >f16492a6-2d0e-4657-88e3-a9f4d8e48e74." > > > >which is also reported by the vdsm-client; vdsm-client Volume getSize > >storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3 > >storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829 > >imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd > >volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >vdsm-client: Command Volume.getSize with args {'storagepoolID': > >'59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID': > >'5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID': > >'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID': > >'2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed: > >(code=100, message=[Errno 107] Transport endpoint is not connected) > > > >with corresponding gluster mount log; > >[2020-01-27 19:42:22.678793] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-14: > >remote operation failed. Path: > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied] > >[2020-01-27 19:42:22.678828] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-13: > >remote operation failed. Path: > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied] > >[2020-01-27 19:42:22.679806] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-14: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:42:22.679862] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-13: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:42:22.679981] W [MSGID: 108027] > >[afr-common.c:2274:afr_attempt_readsubvol_set] > >0-ovirt-data-replicate-3: no > >read subvols for > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >[2020-01-27 19:42:22.680606] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-14: > >remote operation failed. Path: > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:42:22.680622] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-13: > >remote operation failed. Path: > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:42:22.681742] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-13: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:42:22.681871] W [MSGID: 108027] > >[afr-common.c:2274:afr_attempt_readsubvol_set] > >0-ovirt-data-replicate-3: no > >read subvols for > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >[2020-01-27 19:42:22.682344] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-14: > >remote operation failed. Path: > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >The message "W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-14: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2 > >times between [2020-01-27 19:42:22.679806] and [2020-01-27 > >19:42:22.683308] > >[2020-01-27 19:42:22.683327] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-data-client-13: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:42:22.683438] W [MSGID: 108027] > >[afr-common.c:2274:afr_attempt_readsubvol_set] > >0-ovirt-data-replicate-3: no > >read subvols for > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get] > >(-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b) > >[0x7faaaadeb92b] > >-->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78) > >[0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94) > >[0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds > >[Invalid > >argument] > >[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 176728: LOOKUP() > > >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >=> -1 (Transport endpoint is not connected) > > > >In addition to this, vdsm also reported it couldn't find the image of > >the > >HostedEngine, and refused to boot; > >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) > >[storage.TaskManager.Task] > >(Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error > >(task:875) > >Traceback (most recent call last): > >File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, > >in _run > > return fn(*args, **kargs) > > File "<string>", line 2, in prepareImage > >File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in > >method > > ret = func(*args, **kwargs) > >File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203, > >in prepareImage > > raise se.VolumeDoesNotExist(leafUUID) > >VolumeDoesNotExist: Volume does not exist: > >('38e4fba7-f140-4630-afab-0f744ebe3b57',) > > > >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm] > >(vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process > >failed > >(vm:933) > >Traceback (most recent call last): > > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in > >_startUnderlyingVm > > self._run() > > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in > >_run > > self._devices = self._make_devices() > > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in > >_make_devices > > disk_objs = self._perform_host_local_adjustment() > > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in > >_perform_host_local_adjustment > > self._preparePathsForDrives(disk_params) > > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in > >_preparePathsForDrives > > drive, self.id, path=path > > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in > >prepareVolumePath > > raise vm.VolumeError(drive) > >VolumeError: Bad volume specification {'protocol': 'gluster', > >'address': > >{'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', > >'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64', > >'index': > >0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {}, > >'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64', > >'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk', > >'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0', > >'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000', > >'device': 'disk', 'path': > > >'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57', > >'propagateErrors': 'off', 'name': 'vda', 'volumeID': > >'38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias': > >'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name': > >'10.201.0.9', > >'port': '0'}], 'discard': False} > > > >And last, there is a storage domain which refuses to activate (from de > >vsdm.log); > >2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error > >checking path > >/rhev/data-center/mnt/glusterSD/10.201.0.11: > _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata > >(monitor:499) > >Traceback (most recent call last): > > File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line > >497, in _pathChecked > > delay = result.delay() > >File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line > >391, > >in delay > > raise exception.MiscFileReadException(self.path, self.rc, self.err) > >MiscFileReadException: Internal file read failure: > >(u'/rhev/data-center/mnt/glusterSD/10.201.0.11: > _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata', > >1, bytearray(b"/usr/bin/dd: failed to open > >\'/rhev/data-center/mnt/glusterSD/10.201.0.11: > _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\': > >Transport endpoint is not connected\n")) > > > >corresponding gluster mount log; > >The message "W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-0: > >remote operation failed. Path: > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md > >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 > >times between [2020-01-27 19:58:33.063826] and [2020-01-27 > >19:59:21.690134] > >The message "W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-1: > >remote operation failed. Path: > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md > >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 > >times between [2020-01-27 19:58:33.063734] and [2020-01-27 > >19:59:21.690150] > >The message "W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-0: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 > >times between [2020-01-27 19:58:33.065027] and [2020-01-27 > >19:59:21.691313] > >The message "W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-1: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 > >times between [2020-01-27 19:58:33.065106] and [2020-01-27 > >19:59:21.691328] > >The message "W [MSGID: 108027] > >[afr-common.c:2274:afr_attempt_readsubvol_set] > >0-ovirt-mon-2-replicate-0: > >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md" > >repeated > >4 times between [2020-01-27 19:58:33.065163] and [2020-01-27 > >19:59:21.691369] > >[2020-01-27 19:59:50.539315] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-0: > >remote operation failed. Path: > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:59:50.539321] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-1: > >remote operation failed. Path: > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:59:50.540412] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-1: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:59:50.540477] W [MSGID: 114031] > >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] > >0-ovirt-mon-2-client-0: > >remote operation failed. Path: (null) > >(00000000-0000-0000-0000-000000000000) [Permission denied] > >[2020-01-27 19:59:50.540533] W [MSGID: 108027] > >[afr-common.c:2274:afr_attempt_readsubvol_set] > >0-ovirt-mon-2-replicate-0: > >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md > >[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 99: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md > >=> -1 (Transport endpoint is not connected) > >[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 105: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 112: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 118: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 125: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 131: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 137: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 144: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 150: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 156: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 163: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 169: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 176: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > >[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk] > >0-glusterfs-fuse: 183: LOOKUP() > >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint > >is > >not connected) > > > >and some apps directly connecting to gluster mounts report these > >error's; > >2020-01-27 1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not > >found > >(Errcode: 107 "Transport endpoint is not connected") > >2020-01-27 3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not > >found (Errcode: 107 "Transport endpoint is not connected") > > > >So the errors seem to hint to either a connection issue or a quorum > >loss of > >some sort. However gluster is running on it's own private and separate > >network, with no firewall rules or anything else which could obstruct > >the > >connection. > >In addition gluster volume status reports all bricks and nodes are up, > >and > >gluster volume heal reports no pending heals. > >What makes this issue even more interesting is that when i manually > >check > >the files all seems fine; > > > >for the first issue, where the machine won't start because vdsm cannot > >determine the size. > >qemu is able to report the size; > >qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7: > > >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46 > >57-88e3-a9f4d8e48e74 > >image: /rhev/data-center/mnt/glusterSD/10.201.0.7: > > >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >file format: raw > >virtual size: 34T (37580963840000 bytes) > >disk size: 7.1T > >in addition i'm able to mount the volume using a loop device; > >losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7: > > >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 > >kpartx -av /dev/loop0 > >vgscan > >vgchange -ay > >mount /dev/mapper/cl--data5-data5 /data5/ > >after this i'm able to see all contents of the disk, and in fact write > >to > >it. So the earlier reported connection error doesn't seem to apply > >here? > >This is actually how i'm currently running the VM, where i detached the > >disk, and mounted it in the VM via the loop device. The disk is a data > >disk for a heavily loaded mysql instance, and mysql is reporting no > >errors, > >and has been running for about a day now. > >Of course this not the way it should run, but it is at least working, > >only > >performance seems a bit off. So i would like to solve the issue and > >being > >able to attach the image as disk again. > > > >for the second issue where the Image of the HostedEngine couldn't be > >found, > >also all seems correct; > >The file is there and having the correct permissions; > > ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9 > > >\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/ > >total 49406333 > >drwxr-xr-x. 2 vdsm kvm 4096 Jan 25 12:03 . > >drwxr-xr-x. 13 vdsm kvm 4096 Jan 25 14:16 .. > >-rw-rw----. 1 vdsm kvm 62277025792 Jan 23 03:04 > >38e4fba7-f140-4630-afab-0f744ebe3b57 > >-rw-rw----. 1 vdsm kvm 1048576 Jan 25 21:48 > >38e4fba7-f140-4630-afab-0f744ebe3b57.lease > >-rw-r--r--. 1 vdsm kvm 285 Jan 27 2018 > >38e4fba7-f140-4630-afab-0f744ebe3b57.meta > >And i'm able to mount the image using a loop device and access it's > >contents. > >Unfortunate the VM wouldn't boot due to XFS error's. After tinkering > >with > >this for about a day to make it boot, i gave up and restored from a > >recent > >backup. But i took the data dir from postgress from the mounted old > >image > >to the new VM, and postgress was perfectly fine with it, also > >indicating > >the image wasn't a complete toast. > > > >And the last issue where the storage domain wouldn't activate. The file > >it > >claims it cannot read in the log is perfectly readable and writable; > >cat /rhev/data-center/mnt/glusterSD/10.201.0.11: > >_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata > >CLASS=Data > >DESCRIPTION=ovirt-mon-2 > >IOOPTIMEOUTSEC=10 > >LEASERETRIES=3 > >LEASETIMESEC=60 > >LOCKPOLICY= > >LOCKRENEWALINTERVALSEC=5 > >POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3 > >REMOTE_PATH=10.201.0.11:/ovirt-mon-2 > >ROLE=Regular > >SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458 > >TYPE=GLUSTERFS > >VERSION=4 > >_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807 > > > >So i've no clue where these "Transport endpoint is not connected" are > >coming from, and how to resolve them? > > > >I think there are 4 possible causes for this issue; > >1) I was trying to optimize the throughput of gluster on some volumes, > >since we recently gained some additional write load, which we had > >difficulty keeping up with. So I tried to incrementally > >add server.event-threads, via; > >gluster v set ovirt-data server.event-threads X > >since this didn't seem to improve the performance i changed it back to > >it's > >original values. But when i did that the VM's running on these volumes > >all > >locked-up, and required a reboot, which was by than still possible. > >Please > >note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't > >changed. > > > >2) I had a mix of running gluster 6.6 and 6.7, since i was in the > >middle of > >upgrading all to 6.7 > > > >3) On one of the physical brick nodes, after a reboot xfs errors were > >reported, and resolved by xfs_repair, which did remove some inodes in > >the > >process. For which i wasn't too worried since i would expect the > >gluster > >self healing daemon would resolve them, which seemed true for all > >volumes, > >except 1, where 1 gfid was pending for about 2 days. in this case also > >exactly the image which vdsm reports it cannot resolve the size from. > >But > >there are other vm image with the same issue, which i left out for > >brevity. > >However the pending heal of the single gfid resolved once I mounted the > >image via the loop device and started writing to. Which is probably due > >the > >nature on how gluster resolves what needs healing. Despite a gluster > >heal X > >full was issued before. > >I could also confirm the pending gfid was in fact missing on the brick > >node > >on the underlying brick directory, while the heal was still pending. > > > >4) I did some brick replace's (only the ovirt-data volume) but only of > >arbiter bricks of the affected volume in the first issue. > > > >the volume info's of the affected bricks look like this; > > > >Volume Name: ovirt-data > >Type: Distributed-Replicate > >Volume ID: 2775dc10-c197-446e-a73f-275853d38666 > >Status: Started > >Snapshot Count: 0 > >Number of Bricks: 4 x (2 + 1) = 12 > >Transport-type: tcp > >Bricks: > >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data > >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data > >Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter) > >Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data > >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data > >Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter) > >Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data > >Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data > >Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter) > >Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data > >Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data > >Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter) > >Options Reconfigured: > >cluster.choose-local: off > >server.outstanding-rpc-limit: 1024 > >storage.owner-gid: 36 > >storage.owner-uid: 36 > >transport.address-family: inet > >performance.readdir-ahead: on > >nfs.disable: on > >performance.quick-read: off > >performance.read-ahead: off > >performance.io-cache: off > >performance.stat-prefetch: off > >performance.low-prio-threads: 32 > >network.remote-dio: off > >cluster.eager-lock: enable > >cluster.quorum-type: auto > >cluster.server-quorum-type: server > >cluster.data-self-heal-algorithm: full > >cluster.locking-scheme: granular > >cluster.shd-max-threads: 8 > >cluster.shd-wait-qlength: 10000 > >features.shard: on > >user.cifs: off > >performance.write-behind-window-size: 512MB > >performance.cache-size: 384MB > >server.event-threads: 5 > >performance.strict-o-direct: on > >cluster.brick-multiplex: on > > > >Volume Name: ovirt-engine > >Type: Distributed-Replicate > >Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc > >Status: Started > >Snapshot Count: 0 > >Number of Bricks: 3 x 3 = 9 > >Transport-type: tcp > >Bricks: > >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine > >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine > >Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine > >Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine > >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine > >Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine > >Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine > >Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine > >Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine > >Options Reconfigured: > >performance.strict-o-direct: on > >performance.write-behind-window-size: 512MB > >features.shard-block-size: 64MB > >performance.cache-size: 128MB > >nfs.disable: on > >transport.address-family: inet > >performance.quick-read: off > >performance.read-ahead: off > >performance.io-cache: off > >performance.low-prio-threads: 32 > >network.remote-dio: enable > >cluster.eager-lock: enable > >cluster.quorum-type: auto > >cluster.server-quorum-type: server > >cluster.data-self-heal-algorithm: full > >cluster.locking-scheme: granular > >cluster.shd-max-threads: 8 > >cluster.shd-wait-qlength: 10000 > >features.shard: on > >user.cifs: off > >storage.owner-uid: 36 > >storage.owner-gid: 36 > >cluster.brick-multiplex: on > > > >Volume Name: ovirt-mon-2 > >Type: Replicate > >Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd > >Status: Started > >Snapshot Count: 0 > >Number of Bricks: 1 x (2 + 1) = 3 > >Transport-type: tcp > >Bricks: > >Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2 > >Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2 > >Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter) > >Options Reconfigured: > >performance.client-io-threads: on > >nfs.disable: on > >transport.address-family: inet > >performance.quick-read: off > >performance.read-ahead: off > >performance.io-cache: off > >performance.low-prio-threads: 32 > >network.remote-dio: off > >cluster.eager-lock: enable > >cluster.quorum-type: auto > >cluster.server-quorum-type: server > >cluster.data-self-heal-algorithm: full > >cluster.locking-scheme: granular > >cluster.shd-max-threads: 8 > >cluster.shd-wait-qlength: 10000 > >features.shard: on > >user.cifs: off > >cluster.choose-local: off > >client.event-threads: 4 > >server.event-threads: 4 > >storage.owner-uid: 36 > >storage.owner-gid: 36 > >performance.strict-o-direct: on > >performance.cache-size: 64MB > >performance.write-behind-window-size: 128MB > >features.shard-block-size: 64MB > >cluster.brick-multiplex: on > > > >Thanks Olaf > > Hi Olaf, > > Thanks for the detailed output. > On first glance I have noticed that you have a HostedEngine domain for > both ovirt's engine VM + for other VMs , is that right? > If yes, that's against best practices and not recommended. > Second, you use brick multiplexing, but according to RH documentation - > that feature is not supported for your workload - so in your case its > drawing attention but should not be a problem. > > Can you specify how many physical hosts do you have ? > > I will try to check the output deeper, but I think you need to check: > 1. Check gluster heal status - any pending heals should be resolved > 2. Use telnet/nc/ncat/netcat to verify that each host sees the peers' > brick ports. > 3. gluster volume heal <volume> info should report all bricks arr connected > gluster volume status must report all bricks have a pid > 4. OPTIONAL - Try to create smaller (it's not a good idea to have large > qcow2 disks) disks via oVirt and assign them to your mysql. Then try to > pvmove the LVs from the disk (mounted with loop) to the new disks - that > way you can get rid of the old qcow disk . > 5. What is your oVirt version ? Could it be an old 3.x ? > > Don't forget to backup :) > > Best Regards, > Strahil Nikolov >
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users