Dear All,

I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While previous 
upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a different 
experience. After first trying a test upgrade on a 3 node setup, which went 
fine. i headed to upgrade the 9 node production platform, unaware of the 
backward compatibility issues between gluster 3.12.15 -> 5.3. After upgrading 2 
nodes, the HA engine stopped and wouldn't start. Vdsm wasn't able to mount the 
engine storage domain, since /dom_md/metadata was missing or couldn't be 
accessed. Restoring this file by getting a good copy of the underlying bricks, 
removing the file from the underlying bricks where the file was 0 bytes and 
mark with the stickybit, and the corresponding gfid's. Removing the file from 
the mount point, and copying back the file on the mount point. Manually 
mounting the engine domain,  and manually creating the corresponding symbolic 
links in /rhev/data-center and /var/run/vdsm/storage and fixing the ownership 
back to vdsm.kvm (which was root.root), i was able to start the HA engine 
again. Since the engine was up again, and things seemed rather unstable i 
decided to continue the upgrade on the other nodes suspecting an 
incompatibility in gluster versions, i thought would be best to have them all 
on the same version rather soonish. However things went from bad to worse, the 
engine stopped again, and all vm’s stopped working as well.  So on a machine 
outside the setup and restored a backup of the engine taken from version 4.2.8 
just before the upgrade. With this engine I was at least able to start some 
vm’s again, and finalize the upgrade. Once the upgraded, things didn’t 
stabilize and also lose 2 vm’s during the process due to image corruption. 
After figuring out gluster 5.3 had quite some issues I was as lucky to see 
gluster 5.5 was about to be released, on the moment the RPM’s were available 
I’ve installed those. This helped a lot in terms of stability, for which I’m 
very grateful! However the performance is unfortunate terrible, it’s about 15% 
of what the performance was running gluster 3.12.15. It’s strange since a 
simple dd shows ok performance, but our actual workload doesn’t. While I would 
expect the performance to be better, due to all improvements made since gluster 
version 3.12. Does anybody share the same experience?
I really hope gluster 6 will soon be tested with ovirt and released, and things 
start to perform and stabilize again..like the good old days. Of course when I 
can do anything, I’m happy to help.

I think the following short list of issues we have after the migration;
Gluster 5.5;
-       Poor performance for our workload (mostly write dependent)
-       VM’s randomly pause on unknown storage errors, which are “stale 
file’s”. corresponding log; Lookup on shard 797 failed. Base file gfid = 
8a27b91a-ff02-42dc-bd4c-caa019424de8 [Stale file handle]
-       Some files are listed twice in a directory (probably related the stale 
file issue?)
Example;
ls -la  
/rhev/data-center/59cd53a9-0003-02d7-00eb-0000000001e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
total 3081
drwxr-x---.  2 vdsm kvm    4096 Mar 18 11:34 .
drwxr-xr-x. 13 vdsm kvm    4096 Mar 19 09:42 ..
-rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55 
1a7cf259-6b29-421d-9688-b25dfaafb13c
-rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55 
1a7cf259-6b29-421d-9688-b25dfaafb13c
-rw-rw----.  1 vdsm kvm 1048576 Jan 27  2018 
1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
-rw-r--r--.  1 vdsm kvm     290 Jan 27  2018 
1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
-rw-r--r--.  1 vdsm kvm     290 Jan 27  2018 
1a7cf259-6b29-421d-9688-b25dfaafb13c.meta

- brick processes sometimes starts multiple times. Sometimes I’ve 5 brick 
processes for a single volume. Killing all glusterfsd’s for the volume on the 
machine and running gluster v start <vol> force usually just starts one after 
the event, from then on things look all right. 

Ovirt 4.3.2.1-1.el7
-       All vms images ownership are changed to root.root after the vm is 
shutdown, probably related to; 
https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped to the 
HA engine. I’m still in compatibility mode 4.2 for the cluster and for the 
vm’s, but upgraded to version ovirt 4.3.2
-       The network provider is set to ovn, which is fine..actually cool, only 
the “ovs-vswitchd” is a CPU hog, and utilizes 100%
-       It seems on all nodes vdsm tries to get the the stats for the HA 
engine, which is filling the logs with (not sure if this is new);
[api.virt] FINISH getStats return={'status': {'message': "Virtual machine does 
not exist: {'vmId': u'20d69acd-edfd-4aeb-a2ae-49e9c121b7e9'}", 'code': 1}} 
from=::1,59290, vmId=20d69acd-edfd-4aeb-a2ae-49e9c121b7e9 (api:54)
-       It seems the package os_brick [root] managedvolume not supported: 
Managed Volume Not Supported. Missing package os-brick.: ('Cannot import 
os_brick',) (caps:149)  which fills the vdsm.log, but for this I also saw 
another message, so I suspect this will already be resolved shortly
-       The machine I used to run the backup HA engine, doesn’t want to get 
removed from the hosted-engine –vm-status, not even after running; 
hosted-engine --clean-metadata --host-id=10 --force-clean or hosted-engine 
--clean-metadata --force-clean from the machine itself.

Think that's about it.

Don’t get me wrong, I don’t want to rant, I just wanted to share my experience 
and see where things can made better. 


Best Olaf
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/3CO35Q7VZMWNHS4LPUJNO7S47MGLSKS5/

Reply via email to