Ok, this is more strange. The same dd test against my ssd os/boot drives on oVirt node hosts using the same model drive (only smaller) and same h310 controller (only diff being the os/boot drives are in raid mirror and gluster drives are passthrough) test completes in <2 seconds in /tmp of host but takes ~45 seconds in /gluster_bricks/brick_whatever
Is there any explanation why there is such a vast difference between the two tests? example of one my mounts: /dev/mapper/onn_orchard1-tmp /tmp ext4 defaults,discard 1 2 /dev/gluster_vg_sda/gluster_lv_prod_a /gluster_bricks/brick_a xfs inode64,noatime,nodiratime 0 0 On Sun, Mar 8, 2020 at 12:23 PM Jayme <jay...@gmail.com> wrote: > Strahil, > > I'm starting to think that my problem could be related to the use of perc > H310 mini raid controllers in my oVirt hosts. The os/boot SSDs are raid > mirror but gluster storage is SSDs in passthrough. I've read that the queue > depth of h310 card is very low and can cause performance issues > especially when used with flash devices. > > dd if=/dev/zero of=test4.img bs=512 count=5000 oflag=dsync on one of my > hosts gluster bricks /gluster_bricks/brick_a for example takes 45 seconds > to complete. > > I can perform the same operation in ~2 seconds on another server with a > better raid controller, but with the same model ssd. > > I might look at seeing how I can swap out the h310's, unfortunately I > think that may require me to wipe the gluster storage drives as with > another controller I believe they'd need to be added as single raid 0 > arrays and would need to be rebuilt to do so. > > If I were to take one host down at a time is there a way that I can > re-build the entire server including wiping the gluster disks and add the > host back into the ovirt cluster and rebuild it along with the bricks? How > would you recommend doing such a task if I needed to wipe gluster disks on > each host ? > > > > On Sat, Mar 7, 2020 at 6:24 PM Jayme <jay...@gmail.com> wrote: > >> No worries at all about the length of the email, the details are highly >> appreciated. You've given me lots to look into and consider. >> >> >> >> On Sat, Mar 7, 2020 at 10:02 AM Strahil Nikolov <hunter86...@yahoo.com> >> wrote: >> >>> On March 7, 2020 1:12:58 PM GMT+02:00, Jayme <jay...@gmail.com> wrote: >>> >Thanks again for the info. You’re probably right about the testing >>> >method. >>> >Though the reason I’m down this path in the first place is because I’m >>> >seeing a problem in real world work loads. Many of my vms are used in >>> >development environments where working with small files is common such >>> >as >>> >npm installs working with large node_module folders, ci/cd doing lots >>> >of >>> >mixed operations io and compute. >>> > >>> >I started testing some of these things by comparing side to side with a >>> >vm >>> >using same specs only difference being gluster vs nfs storage. Nfs >>> >backed >>> >storage is performing about 3x better real world. >>> > >>> >Gluster version is stock that comes with 4.3.7. I haven’t attempted >>> >updating it outside of official ovirt updates. >>> > >>> >I’d like to see if I could improve it to handle my workloads better. I >>> >also >>> >understand that replication adds overhead. >>> > >>> >I do wonder how much difference in performance there would be with >>> >replica >>> >3 vs replica 3 arbiter. I’d assume arbiter setup would be faster but >>> >perhaps not by a considerable difference. >>> > >>> >I will check into c states as well >>> > >>> >On Sat, Mar 7, 2020 at 2:52 AM Strahil Nikolov <hunter86...@yahoo.com> >>> >wrote: >>> > >>> >> On March 7, 2020 1:09:37 AM GMT+02:00, Jayme <jay...@gmail.com> >>> >wrote: >>> >> >Strahil, >>> >> > >>> >> >Thanks for your suggestions. The config is pretty standard HCI setup >>> >> >with >>> >> >cockpit and hosts are oVirt node. XFS was handled by the deployment >>> >> >automatically. The gluster volumes were optimized for virt store. >>> >> > >>> >> >I tried noop on the SSDs, that made zero difference in the tests I >>> >was >>> >> >running above. I took a look at the random-io-profile and it looks >>> >like >>> >> >it >>> >> >really only sets vm.dirty_background_ratio = 2 & vm.dirty_ratio = 5 >>> >-- >>> >> >my >>> >> >hosts already appear to have those sysctl values, and by default are >>> >> >using virtual-host tuned profile. >>> >> > >>> >> >I'm curious what a test like "dd if=/dev/zero of=test2.img bs=512 >>> >> >count=1000 oflag=dsync" on one of your VMs would show for results? >>> >> > >>> >> >I haven't done much with gluster profiling but will take a look and >>> >see >>> >> >if >>> >> >I can make sense of it. Otherwise, the setup is pretty stock oVirt >>> >HCI >>> >> >deployment with SSD backed storage and 10Gbe storage network. I'm >>> >not >>> >> >coming anywhere close to maxing network throughput. >>> >> > >>> >> >The NFS export I was testing was an export from a local server >>> >> >exporting a >>> >> >single SSD (same type as in the oVirt hosts). >>> >> > >>> >> >I might end up switching storage to NFS and ditching gluster if >>> >> >performance >>> >> >is really this much better... >>> >> > >>> >> > >>> >> >On Fri, Mar 6, 2020 at 5:06 PM Strahil Nikolov >>> ><hunter86...@yahoo.com> >>> >> >wrote: >>> >> > >>> >> >> On March 6, 2020 6:02:03 PM GMT+02:00, Jayme <jay...@gmail.com> >>> >> >wrote: >>> >> >> >I have 3 server HCI with Gluster replica 3 storage (10GBe and SSD >>> >> >> >disks). >>> >> >> >Small file performance inner-vm is pretty terrible compared to a >>> >> >> >similar >>> >> >> >spec'ed VM using NFS mount (10GBe network, SSD disk) >>> >> >> > >>> >> >> >VM with gluster storage: >>> >> >> > >>> >> >> ># dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync >>> >> >> >1000+0 records in >>> >> >> >1000+0 records out >>> >> >> >512000 bytes (512 kB) copied, 53.9616 s, 9.5 kB/s >>> >> >> > >>> >> >> >VM with NFS: >>> >> >> > >>> >> >> ># dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync >>> >> >> >1000+0 records in >>> >> >> >1000+0 records out >>> >> >> >512000 bytes (512 kB) copied, 2.20059 s, 233 kB/s >>> >> >> > >>> >> >> >This is a very big difference, 2 seconds to copy 1000 files on >>> >NFS >>> >> >VM >>> >> >> >VS 53 >>> >> >> >seconds on the other. >>> >> >> > >>> >> >> >Aside from enabling libgfapi is there anything I can tune on the >>> >> >> >gluster or >>> >> >> >VM side to improve small file performance? I have seen some >>> >guides >>> >> >by >>> >> >> >Redhat in regards to small file performance but I'm not sure >>> >what/if >>> >> >> >any of >>> >> >> >it applies to oVirt's implementation of gluster in HCI. >>> >> >> >>> >> >> You can use the rhgs-random-io tuned profile from >>> >> >> >>> >> > >>> >> >>> > >>> ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.4.2.0-1.el7rhgs.src.rpm >>> >> >> and try with that on your hosts. >>> >> >> In my case, I have modified it so it's a mixture between >>> >> >rhgs-random-io >>> >> >> and the profile for Virtualization Host. >>> >> >> >>> >> >> Also,ensure that your bricks are using XFS with relatime/noatime >>> >> >mount >>> >> >> option and your scheduler for the SSDs is either 'noop' or 'none' >>> >> >.The >>> >> >> default I/O scheduler for RHEL7 is deadline which is giving >>> >> >preference to >>> >> >> reads and your workload is definitely 'write'. >>> >> >> >>> >> >> Ensure that the virt settings are enabled for your gluster >>> >volumes: >>> >> >> 'gluster volume set <volname> group virt' >>> >> >> >>> >> >> Also, are you running on fully allocated disks for the VM or you >>> >> >started >>> >> >> thin ? >>> >> >> I'm asking as creation of new shards at gluster level is a slow >>> >> >task. >>> >> >> >>> >> >> Have you checked gluster profiling the volume? It can clarify >>> >what >>> >> >is >>> >> >> going on. >>> >> >> >>> >> >> >>> >> >> Also are you comparing apples to apples ? >>> >> >> For example, 1 ssd mounted and exported as NFS and a replica 3 >>> >> >volume >>> >> >> of the same type of ssd ? If not, the NFS can have more iops due >>> >to >>> >> >> multiple disks behind it, while Gluster has to write the same >>> >thing >>> >> >on all >>> >> >> nodes. >>> >> >> >>> >> >> Best Regards, >>> >> >> Strahil Nikolov >>> >> >> >>> >> >> >>> >> >>> >> Hi Jayme, >>> >> >>> >> >>> >> My test are not quite good ,as I have a different setup: >>> >> >>> >> NVME - VDO - 4 thin LVs -XFS - 4 Gluster volumes (replica 2 arbiter >>> >1) >>> >> - 4 storage domains - striped LV in each VM >>> >> >>> >> RHEL7 VM (fully stock): >>> >> [root@node1 ~]# dd if=/dev/zero of=test2.img bs=512 count=1000 >>> >oflag=dsync >>> >> 1000+0 records in >>> >> 1000+0 records out >>> >> 512000 bytes (512 kB) copied, 19.8195 s, 25.8 kB/s >>> >> [root@node1 ~]# >>> >> >>> >> Brick: >>> >> [root@ovirt1 data_fast]# dd if=/dev/zero of=test2.img bs=512 >>> >count=1000 >>> >> oflag=dsync >>> >> 1000+0 records in >>> >> 1000+0 records out >>> >> 512000 bytes (512 kB) copied, 1.41192 s, 363 kB/s >>> >> >>> >> As I use VDO with compression (on 1/4 of the NVMe) - I cannot expect >>> >any >>> >> performance from it. >>> >> >>> >> >>> >> Is your app really using dsync ? I have seen many times that >>> >performance >>> >> testing with the wrong tools/tests cause more trouble than it >>> >should. >>> >> >>> >> I would recommend you to test with a real workload before deciding to >>> >> change the architecture. >>> >> >>> >> I forgot to mention that you need to disable c states for your >>> >systems if >>> >> you are chasing performance. >>> >> Run a gluster profile while you run real workload in your VMs and >>> >then >>> >> provide that for analysis. >>> >> >>> >> Which version of Gluster are you using ? >>> >> >>> >> Best Regards, >>> >> Strahil Nikolov >>> >> >>> >>> Hm... >>> Then you do have a real workload scenario - pick one of the most often >>> used tasks and use it's time of completion for reference. >>> Synthetic benchmarking is not good. >>> >>> As far as I know oVirt is actually running on gluster v6.X . >>> @Sandro, >>> Can you hint us the highest supported gluster version on oVirt ? I'm >>> running v7.0, so I'm little bit off the track. >>> >>> Jayme, >>> >>> Next steps are to check: >>> 1. Did you disable cstates - there are very good articles for >>> RHEL/CentOS 7 >>> 2. Check firmware of your HCI nodes - I've seen numerous network/SAN >>> issues due to old firmware including stucked processes >>> 3. Check the articles for RHV and hugepages . If your VMs are memory >>> dynamic and lots of RAM is needed -> hugepages will bring more performance. >>> Second , transparent huge pages must be disabled. >>> 4. Create a High Performance VM for testing purposes with fully >>> allocated disks >>> 5. Check if 'noatime' or 'relatime' is set for the bricks. If selinux >>> is in enforcing mode (I highly recommend that), you can use mount option >>> 'system_u:object_r:glusterd_brick_t:s0' which will cause the kernel to >>> reduce lookups to check the SELINUX context of all files in the brick - >>> and increasing the performance. >>> >>> 6. Consider switching to 'noop'/'none' or tuning 'deadline' I/O >>> scheduler to match your needs >>> >>> 7. Create a gluster profile during the VM(step 4) is being tested , >>> as if is needed. >>> >>> 8. Consider using 'Pass-through host cpu' which is enabled in UI via >>> -> VM-> edit -> Host -> Start on specific host -> select all hosts with >>> the same cpu -> allow manual and automatic migration -> OK >>> This mode allows all instructions on the Host CPU to be available on >>> the guest, greatly increasing performance for a lot of software. >>> >>> >>> The difference between 'replica 3' and 'replica 3 arbiter 1' (old name >>> was 'replica 2 arbiter 1' but it means the same) is the fact that the >>> arbitrated volume requiress less bandwidth (due to the fact that the >>> files on the arbiter has 0 bytes of data) and stores only metadata to >>> prevent splitbrain. >>> Drawbacks of the arbiter is that you have only 2 sources to read from, >>> while replica 3 provides three sources to read from. >>> With glusterd 2.0 ( I think it was introduced in gluster v7 ) the >>> arbiter doesn't need to be locally (which means higher lattencies are no >>> longer an issue), and is only needed when one of data bricks is >>> needed.Still, the remote arbiter is too new for prod. >>> >>> Next: You can consider clusterized 2-node NFS Ganesha (with quorum >>> device for the third vote) as an NFS source. The good thing about NFS >>> Ganes is the primary focus from the Gluster community and it uses >>> libgfapi to connect to the backend (replica volume). >>> >>> I think it's enough for now , but I guess other stuff could come to >>> my mind at later stage. >>> >>> Edit: This e-mail is way longer than I initially thought to be.Sorry >>> about that. >>> >>> >>> Best Regards, >>> Strahil Nikolov >>> >>
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/52IQWC5X7XZQLDHRAKVO4OINJSES75LE/