[ovirt-users] Re: What if anything can be done to improve small file performance with gluster?

Jayme Sun, 08 Mar 2020 08:42:07 -0700

Ok, this is more strange.  The same dd test against my ssd os/boot drives
on oVirt node hosts using the same model drive (only smaller) and same h310
controller (only diff being the os/boot drives are in raid mirror and
gluster drives are passthrough) test completes in <2 seconds in /tmp of
host but takes ~45 seconds in /gluster_bricks/brick_whatever


Is there any explanation why there is such a vast difference between the
two tests?

example of one my mounts:

/dev/mapper/onn_orchard1-tmp /tmp ext4 defaults,discard 1 2
/dev/gluster_vg_sda/gluster_lv_prod_a /gluster_bricks/brick_a xfs
inode64,noatime,nodiratime 0 0

On Sun, Mar 8, 2020 at 12:23 PM Jayme <jay...@gmail.com> wrote:

> Strahil,
>
> I'm starting to think that my problem could be related to the use of perc
> H310 mini raid controllers in my oVirt hosts. The os/boot SSDs are raid
> mirror but gluster storage is SSDs in passthrough. I've read that the queue
> depth of h310 card is very low and can cause performance issues
> especially when used with flash devices.
>
> dd if=/dev/zero of=test4.img bs=512 count=5000 oflag=dsync on one of my
> hosts gluster bricks /gluster_bricks/brick_a for example takes 45 seconds
> to complete.
>
> I can perform the same operation in ~2 seconds on another server with a
> better raid controller, but with the same model ssd.
>
> I might look at seeing how I can swap out the h310's, unfortunately I
> think that may require me to wipe the gluster storage drives as with
> another controller I believe they'd need to be added as single raid 0
> arrays and would need to be rebuilt to do so.
>
> If I were to take one host down at a time is there a way that I can
> re-build the entire server including wiping the gluster disks and add the
> host back into the ovirt cluster and rebuild it along with the bricks? How
> would you recommend doing such a task if I needed to wipe gluster disks on
> each host ?
>
>
>
> On Sat, Mar 7, 2020 at 6:24 PM Jayme <jay...@gmail.com> wrote:
>
>> No worries at all about the length of the email, the details are highly
>> appreciated. You've given me lots to look into and consider.
>>
>>
>>
>> On Sat, Mar 7, 2020 at 10:02 AM Strahil Nikolov <hunter86...@yahoo.com>
>> wrote:
>>
>>> On March 7, 2020 1:12:58 PM GMT+02:00, Jayme <jay...@gmail.com> wrote:
>>> >Thanks again for the info. You’re probably right about the testing
>>> >method.
>>> >Though the reason I’m down this path in the first place is because I’m
>>> >seeing a problem in real world work loads. Many of my vms are used in
>>> >development environments where working with small files is common such
>>> >as
>>> >npm installs working with large node_module folders, ci/cd doing lots
>>> >of
>>> >mixed operations io and compute.
>>> >
>>> >I started testing some of these things by comparing side to side with a
>>> >vm
>>> >using same specs only difference being gluster vs nfs storage. Nfs
>>> >backed
>>> >storage is performing about 3x better real world.
>>> >
>>> >Gluster version is stock that comes with 4.3.7. I haven’t attempted
>>> >updating it outside of official ovirt updates.
>>> >
>>> >I’d like to see if I could improve it to handle my workloads better. I
>>> >also
>>> >understand that replication adds overhead.
>>> >
>>> >I do wonder how much difference in performance there would be with
>>> >replica
>>> >3 vs replica 3 arbiter. I’d assume arbiter setup would be faster but
>>> >perhaps not by a considerable difference.
>>> >
>>> >I will check into c states as well
>>> >
>>> >On Sat, Mar 7, 2020 at 2:52 AM Strahil Nikolov <hunter86...@yahoo.com>
>>> >wrote:
>>> >
>>> >> On March 7, 2020 1:09:37 AM GMT+02:00, Jayme <jay...@gmail.com>
>>> >wrote:
>>> >> >Strahil,
>>> >> >
>>> >> >Thanks for your suggestions. The config is pretty standard HCI setup
>>> >> >with
>>> >> >cockpit and hosts are oVirt node. XFS was handled by the deployment
>>> >> >automatically. The gluster volumes were optimized for virt store.
>>> >> >
>>> >> >I tried noop on the SSDs, that made zero difference in the tests I
>>> >was
>>> >> >running above. I took a look at the random-io-profile and it looks
>>> >like
>>> >> >it
>>> >> >really only sets vm.dirty_background_ratio = 2 & vm.dirty_ratio = 5
>>> >--
>>> >> >my
>>> >> >hosts already appear to have those sysctl values, and by default are
>>> >> >using virtual-host tuned profile.
>>> >> >
>>> >> >I'm curious what a test like "dd if=/dev/zero of=test2.img bs=512
>>> >> >count=1000 oflag=dsync" on one of your VMs would show for results?
>>> >> >
>>> >> >I haven't done much with gluster profiling but will take a look and
>>> >see
>>> >> >if
>>> >> >I can make sense of it. Otherwise, the setup is pretty stock oVirt
>>> >HCI
>>> >> >deployment with SSD backed storage and 10Gbe storage network.  I'm
>>> >not
>>> >> >coming anywhere close to maxing network throughput.
>>> >> >
>>> >> >The NFS export I was testing was an export from a local server
>>> >> >exporting a
>>> >> >single SSD (same type as in the oVirt hosts).
>>> >> >
>>> >> >I might end up switching storage to NFS and ditching gluster if
>>> >> >performance
>>> >> >is really this much better...
>>> >> >
>>> >> >
>>> >> >On Fri, Mar 6, 2020 at 5:06 PM Strahil Nikolov
>>> ><hunter86...@yahoo.com>
>>> >> >wrote:
>>> >> >
>>> >> >> On March 6, 2020 6:02:03 PM GMT+02:00, Jayme <jay...@gmail.com>
>>> >> >wrote:
>>> >> >> >I have 3 server HCI with Gluster replica 3 storage (10GBe and SSD
>>> >> >> >disks).
>>> >> >> >Small file performance inner-vm is pretty terrible compared to a
>>> >> >> >similar
>>> >> >> >spec'ed VM using NFS mount (10GBe network, SSD disk)
>>> >> >> >
>>> >> >> >VM with gluster storage:
>>> >> >> >
>>> >> >> ># dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync
>>> >> >> >1000+0 records in
>>> >> >> >1000+0 records out
>>> >> >> >512000 bytes (512 kB) copied, 53.9616 s, 9.5 kB/s
>>> >> >> >
>>> >> >> >VM with NFS:
>>> >> >> >
>>> >> >> ># dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync
>>> >> >> >1000+0 records in
>>> >> >> >1000+0 records out
>>> >> >> >512000 bytes (512 kB) copied, 2.20059 s, 233 kB/s
>>> >> >> >
>>> >> >> >This is a very big difference, 2 seconds to copy 1000 files on
>>> >NFS
>>> >> >VM
>>> >> >> >VS 53
>>> >> >> >seconds on the other.
>>> >> >> >
>>> >> >> >Aside from enabling libgfapi is there anything I can tune on the
>>> >> >> >gluster or
>>> >> >> >VM side to improve small file performance? I have seen some
>>> >guides
>>> >> >by
>>> >> >> >Redhat in regards to small file performance but I'm not sure
>>> >what/if
>>> >> >> >any of
>>> >> >> >it applies to oVirt's implementation of gluster in HCI.
>>> >> >>
>>> >> >> You can use the rhgs-random-io tuned  profile from
>>> >> >>
>>> >> >
>>> >>
>>> >
>>> ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.4.2.0-1.el7rhgs.src.rpm
>>> >> >> and try with that on your hosts.
>>> >> >> In my case, I have  modified  it so it's a mixture between
>>> >> >rhgs-random-io
>>> >> >> and the profile for Virtualization Host.
>>> >> >>
>>> >> >> Also,ensure that your bricks are  using XFS with relatime/noatime
>>> >> >mount
>>> >> >> option and your scheduler for the SSDs is either  'noop' or 'none'
>>> >> >.The
>>> >> >> default  I/O scheduler for RHEL7 is deadline which is giving
>>> >> >preference to
>>> >> >> reads and  your  workload  is  definitely 'write'.
>>> >> >>
>>> >> >> Ensure that the virt settings are  enabled for your gluster
>>> >volumes:
>>> >> >> 'gluster volume set <volname> group virt'
>>> >> >>
>>> >> >> Also, are you running  on fully allocated disks for the VM or you
>>> >> >started
>>> >> >> thin ?
>>> >> >> I'm asking as creation of new shards  at gluster  level is a slow
>>> >> >task.
>>> >> >>
>>> >> >> Have you checked  gluster  profiling the volume?  It can clarify
>>> >what
>>> >> >is
>>> >> >> going on.
>>> >> >>
>>> >> >>
>>> >> >> Also are you comparing apples to apples ?
>>> >> >> For example, 1 ssd  mounted  and exported  as NFS and a replica 3
>>> >> >volume
>>> >> >> of the same type of ssd ? If not,  the NFS can have more iops due
>>> >to
>>> >> >> multiple disks behind it, while Gluster has to write the same
>>> >thing
>>> >> >on all
>>> >> >> nodes.
>>> >> >>
>>> >> >> Best Regards,
>>> >> >> Strahil Nikolov
>>> >> >>
>>> >> >>
>>> >>
>>> >> Hi Jayme,
>>> >>
>>> >>
>>> >> My test are not quite good ,as I have a different setup:
>>> >>
>>> >> NVME - VDO - 4 thin LVs -XFS - 4  Gluster  volumes (replica 2 arbiter
>>> >1)
>>> >> - 4  storage domains  - striped  LV in each VM
>>> >>
>>> >> RHEL7 VM (fully stock):
>>> >> [root@node1 ~]# dd if=/dev/zero of=test2.img bs=512 count=1000
>>> >oflag=dsync
>>> >> 1000+0 records in
>>> >> 1000+0 records out
>>> >> 512000 bytes (512 kB) copied, 19.8195 s, 25.8 kB/s
>>> >> [root@node1 ~]#
>>> >>
>>> >> Brick:
>>> >> [root@ovirt1 data_fast]# dd if=/dev/zero of=test2.img bs=512
>>> >count=1000
>>> >> oflag=dsync
>>> >> 1000+0 records in
>>> >> 1000+0 records out
>>> >> 512000 bytes (512 kB) copied, 1.41192 s, 363 kB/s
>>> >>
>>> >> As I use VDO with compression (on 1/4 of the NVMe) - I cannot expect
>>> >any
>>> >> performance from it.
>>> >>
>>> >>
>>> >> Is your app really using dsync ? I have seen many times that
>>> >performance
>>> >> testing with the wrong tools/tests cause more  trouble than it
>>> >should.
>>> >>
>>> >> I would recommend you to test with a real workload before deciding to
>>> >> change the architecture.
>>> >>
>>> >> I forgot to mention that you need to disable c states for your
>>> >systems if
>>> >> you are chasing performance.
>>> >> Run a gluster profile while you run real workload in your VMs and
>>> >then
>>> >> provide that for analysis.
>>> >>
>>> >> Which version of Gluster are you using ?
>>> >>
>>> >> Best Regards,
>>> >> Strahil Nikolov
>>> >>
>>>
>>> Hm...
>>> Then you do have a real workload scenario - pick one of the most often
>>> used tasks and use it's time of completion for reference.
>>> Synthetic benchmarking is not good.
>>>
>>> As far as I know oVirt is actually running on gluster v6.X .
>>> @Sandro,
>>> Can you hint us the highest supported gluster version on oVirt  ? I'm
>>> running v7.0, so I'm little bit off the track.
>>>
>>> Jayme,
>>>
>>> Next steps are to check:
>>> 1.  Did you disable cstates - there are very good articles for
>>> RHEL/CentOS 7
>>> 2.  Check firmware  of your HCI nodes - I've seen numerous network/SAN
>>> issues due to old firmware including stucked processes
>>> 3. Check the articles for RHV and hugepages . If your VMs are memory
>>> dynamic and lots of RAM is  needed -> hugepages will bring more performance.
>>> Second , transparent huge pages  must be disabled.
>>> 4.  Create a High Performance  VM for testing purposes  with fully
>>> allocated disks
>>> 5. Check if 'noatime' or  'relatime'  is set for the bricks. If selinux
>>> is in enforcing mode (I highly recommend that), you can use mount option
>>> 'system_u:object_r:glusterd_brick_t:s0'  which will cause the kernel to
>>> reduce  lookups to check the SELINUX context of  all  files in the brick  -
>>> and increasing the performance.
>>>
>>> 6. Consider switching to 'noop'/'none' or tuning 'deadline' I/O
>>> scheduler to match your needs
>>>
>>> 7.  Create a  gluster profile during the VM(step 4) is being tested ,
>>> as  if is needed.
>>>
>>> 8. Consider  using 'Pass-through  host cpu' which is enabled in UI via
>>> ->  VM-> edit -> Host -> Start on specific host -> select all hosts with
>>> the same cpu ->  allow  manual and automatic migration ->  OK
>>> This mode allows  all instructions on the Host CPU to be available  on
>>> the guest, greatly increasing performance  for a lot of software.
>>>
>>>
>>> The difference between 'replica 3'  and 'replica 3 arbiter 1' (old name
>>> was  'replica 2 arbiter 1' but it means  the same)  is the fact that the
>>> arbitrated volume requiress  less bandwidth (due  to the fact that the
>>> files  on the arbiter  has 0 bytes  of data)  and stores  only metadata to
>>> prevent splitbrain.
>>> Drawbacks of the arbiter is that you have only 2 sources  to read from,
>>> while replica 3  provides three sources  to read from.
>>> With glusterd 2.0 ( I think it was introduced in gluster v7 ) the
>>> arbiter doesn't need to be locally (which means higher lattencies are no
>>> longer an issue), and is only needed when one of data bricks is
>>> needed.Still, the remote arbiter is too new for prod.
>>>
>>> Next: You can consider clusterized  2-node NFS Ganesha (with quorum
>>> device for the third vote) as  an NFS source. The good thing about  NFS
>>> Ganes is the primary focus from the Gluster  community  and it uses
>>> libgfapi to connect  to the backend  (replica  volume).
>>>
>>> I  think it's enough for now  , but  I guess  other  stuff could come to
>>> my mind at later  stage.
>>>
>>> Edit: This e-mail is way longer than I initially thought to be.Sorry
>>> about that.
>>>
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/52IQWC5X7XZQLDHRAKVO4OINJSES75LE/

[ovirt-users] Re: What if anything can be done to improve small file performance with gluster?

Reply via email to