Questions/comments inline ... On Thu, Mar 28, 2019 at 10:18 PM <olaf.buitel...@gmail.com> wrote:
> Dear All, > > I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While > previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a > different experience. After first trying a test upgrade on a 3 node setup, > which went fine. i headed to upgrade the 9 node production platform, > unaware of the backward compatibility issues between gluster 3.12.15 -> > 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start. > Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata > was missing or couldn't be accessed. Restoring this file by getting a good > copy of the underlying bricks, removing the file from the underlying bricks > where the file was 0 bytes and mark with the stickybit, and the > corresponding gfid's. Removing the file from the mount point, and copying > back the file on the mount point. Manually mounting the engine domain, and > manually creating the corresponding symbolic links in /rhev/data-center and > /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was > root.root), i was able to start the HA engine again. Since the engine was > up again, and things seemed rather unstable i decided to continue the > upgrade on the other nodes suspecting an incompatibility in gluster > versions, i thought would be best to have them all on the same version > rather soonish. However things went from bad to worse, the engine stopped > again, and all vm’s stopped working as well. So on a machine outside the > setup and restored a backup of the engine taken from version 4.2.8 just > before the upgrade. With this engine I was at least able to start some vm’s > again, and finalize the upgrade. Once the upgraded, things didn’t stabilize > and also lose 2 vm’s during the process due to image corruption. After > figuring out gluster 5.3 had quite some issues I was as lucky to see > gluster 5.5 was about to be released, on the moment the RPM’s were > available I’ve installed those. This helped a lot in terms of stability, > for which I’m very grateful! However the performance is unfortunate > terrible, it’s about 15% of what the performance was running gluster > 3.12.15. It’s strange since a simple dd shows ok performance, but our > actual workload doesn’t. While I would expect the performance to be better, > due to all improvements made since gluster version 3.12. Does anybody share > the same experience? > I really hope gluster 6 will soon be tested with ovirt and released, and > things start to perform and stabilize again..like the good old days. Of > course when I can do anything, I’m happy to help. > > I think the following short list of issues we have after the migration; > Gluster 5.5; > - Poor performance for our workload (mostly write dependent) > For this, could you share the volume-profile output specifically for the affected volume(s)? Here's what you need to do - 1. # gluster volume profile $VOLNAME stop 2. # gluster volume profile $VOLNAME start 3. Run the test inside the vm wherein you see bad performance 4. # gluster volume profile $VOLNAME info # save the output of this command into a file 5. # gluster volume profile $VOLNAME stop 6. and attach the output file gotten in step 4 - VM’s randomly pause on un > known storage errors, which are “stale file’s”. corresponding log; Lookup > on shard 797 failed. Base file gfid = 8a27b91a-ff02-42dc-bd4c-caa019424de8 > [Stale file handle] > Could you share the complete gluster client log file (it would be a filename matching the pattern rhev-data-center-mnt-glusterSD-*) Also the output of `gluster volume info $VOLNAME` > - Some files are listed twice in a directory (probably related the > stale file issue?) > Example; > ls -la > /rhev/data-center/59cd53a9-0003-02d7-00eb-0000000001e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/ > total 3081 > drwxr-x---. 2 vdsm kvm 4096 Mar 18 11:34 . > drwxr-xr-x. 13 vdsm kvm 4096 Mar 19 09:42 .. > -rw-rw----. 1 vdsm kvm 1048576 Mar 28 12:55 > 1a7cf259-6b29-421d-9688-b25dfaafb13c > -rw-rw----. 1 vdsm kvm 1048576 Mar 28 12:55 > 1a7cf259-6b29-421d-9688-b25dfaafb13c > -rw-rw----. 1 vdsm kvm 1048576 Jan 27 2018 > 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease > -rw-r--r--. 1 vdsm kvm 290 Jan 27 2018 > 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta > -rw-r--r--. 1 vdsm kvm 290 Jan 27 2018 > 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta > Adding DHT and readdir-ahead maintainers regarding entries getting listed twice. @Nithya Balachandran <nbala...@redhat.com> ^^ @Gowdappa, Raghavendra <rgowd...@redhat.com> ^^ @Poornima Gurusiddaiah <pguru...@redhat.com> ^^ > > - brick processes sometimes starts multiple times. Sometimes I’ve 5 brick > processes for a single volume. Killing all glusterfsd’s for the volume on > the machine and running gluster v start <vol> force usually just starts one > after the event, from then on things look all right. > Did you mean 5 brick processes for a single brick directory? +Mohit Agrawal <moagr...@redhat.com> ^^ -Krutika > Ovirt 4.3.2.1-1.el7 > - All vms images ownership are changed to root.root after the vm is > shutdown, probably related to; > https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped > to the HA engine. I’m still in compatibility mode 4.2 for the cluster and > for the vm’s, but upgraded to version ovirt 4.3.2 > - The network provider is set to ovn, which is fine..actually cool, > only the “ovs-vswitchd” is a CPU hog, and utilizes 100% > - It seems on all nodes vdsm tries to get the the stats for the HA > engine, which is filling the logs with (not sure if this is new); > [api.virt] FINISH getStats return={'status': {'message': "Virtual machine > does not exist: {'vmId': u'20d69acd-edfd-4aeb-a2ae-49e9c121b7e9'}", 'code': > 1}} from=::1,59290, vmId=20d69acd-edfd-4aeb-a2ae-49e9c121b7e9 (api:54) > - It seems the package os_brick [root] managedvolume not supported: > Managed Volume Not Supported. Missing package os-brick.: ('Cannot import > os_brick',) (caps:149) which fills the vdsm.log, but for this I also saw > another message, so I suspect this will already be resolved shortly > - The machine I used to run the backup HA engine, doesn’t want to > get removed from the hosted-engine –vm-status, not even after running; > hosted-engine --clean-metadata --host-id=10 --force-clean or hosted-engine > --clean-metadata --force-clean from the machine itself. > > Think that's about it. > > Don’t get me wrong, I don’t want to rant, I just wanted to share my > experience and see where things can made better. > > > Best Olaf > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/3CO35Q7VZMWNHS4LPUJNO7S47MGLSKS5/ >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DGMO3Y5NWTLMLRSSLYWYNPRTKKN7XHHG/