[ceph-users] Re: CephFS thrashing through the page cache

Gregory Farnum Wed, 05 Apr 2023 08:05:57 -0700

On Fri, Mar 17, 2023 at 1:56 AM Ashu Pachauri <ashu210...@gmail.com> wrote:
>
> Hi Xiubo,
>
> As you have correctly pointed out, I was talking about the stipe_unit
> setting in the file layout configuration. Here is the documentation for
> that for anyone else's reference:
> https://docs.ceph.com/en/quincy/cephfs/file-layouts/
>
> As with any RAID0 setup, the stripe_unit is definitely workload dependent.
> Our use case requires us to read somewhere from a few kilobytes to a few
> hundred kilobytes at once. Having a 4MB default stripe_unit definitely
> hurts quite a bit. We were able to achieve almost 2x improvement in terms
> of average latency and overall throughput (for useful data) by reducing the
> stripe_unit. The rule of thumb is that you want to align the stripe_unit to
> your most common IO size.


There's a lot more that goes into the stripe_unit than IO size for
CephFS. This may improve your workload, but it more generally just
means your IO accesses go out to more objects (so more PGs and more
OSDs) generally. That can be good if you have low client counts doing
random IO, but means that in general readahead (when not broken, as it
appears to be in your kernel! :/) is much much much less effective and
more expensive.
-Greg

>
> > BTW, have you tried to set 'rasize' option to a small size instead of 0
> > ? Won't this work ?
>
> No this won't work. I have tried it already. Since rasize simply impacts
> readahead, your minimum io size to the cephfs client will still be at the
> maximum of (rasize, stripe_unit).  rasize is a useful configuration only if
> it is required to be larger than the stripe_unit, otherwise it's not. Also,
> it's worth pointing out that simply setting rasize is not sufficient; one
> needs to change the corresponding configurations that control
> maximum/minimum readahead for ceph clients.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li <xiu...@redhat.com> wrote:
>
> >
> > On 15/03/2023 17:20, Frank Schilder wrote:
> > > Hi Ashu,
> > >
> > > are you talking about the kernel client? I can't find "stripe size"
> > anywhere in its mount-documentation. Could you possibly post exactly what
> > you did? Mount fstab line, config setting?
> >
> > There is no mount option to do this in both userspace and kernel
> > clients. You need to change the file layout, which is (4MB stripe_unit,
> > 1 stripe_count and 4MB object_size) by default, instead.
> >
> > Certainly with a smaller size of the stripe_unit will work. But IMO it
> > will depend and be careful, changing the layout may cause other
> > performance issues in some case, for example too small stripe_unit size
> > may split the sync read into more osd requests to different OSDs.
> >
> > I will generate one patch to make the kernel client wiser instead of
> > blindly setting it to stripe_unit always.
> >
> > Thanks
> >
> > - Xiubo
> >
> >
> > >
> > > Thanks!
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: Ashu Pachauri <ashu210...@gmail.com>
> > > Sent: 14 March 2023 19:23:42
> > > To: ceph-users@ceph.io
> > > Subject: [ceph-users] Re: CephFS thrashing through the page cache
> > >
> > > Got the answer to my own question; posting here if someone else
> > > encounters the same problem. The issue is that the default stripe size
> > in a
> > > cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
> > > test I posted) inside the file, you'll end up pulling at least 4MB to the
> > > client (and then discarding most of the pulled data) even if you set
> > > readahead to zero. So, the solution for us was to set a lower stripe
> > size,
> > > which aligns better with our workloads.
> > >
> > > Thanks and Regards,
> > > Ashu Pachauri
> > >
> > >
> > > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri <ashu210...@gmail.com>
> > wrote:
> > >
> > >> Also, I am able to reproduce the network read amplification when I try
> > to
> > >> do very small reads from larger files. e.g.
> > >>
> > >> for i in $(seq 1 10000); do
> > >>    dd if=test_${i} of=/dev/null bs=5k count=10
> > >> done
> > >>
> > >>
> > >> This piece of code generates a network traffic of 3.3 GB while it
> > actually
> > >> reads approx 500 MB of data.
> > >>
> > >>
> > >> Thanks and Regards,
> > >> Ashu Pachauri
> > >>
> > >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri <ashu210...@gmail.com>
> > >> wrote:
> > >>
> > >>> We have an internal use case where we back the storage of a proprietary
> > >>> database by a shared file system. We noticed something very odd when
> > >>> testing some workload with a local block device backed file system vs
> > >>> cephfs. We noticed that the amount of network IO done by cephfs is
> > almost
> > >>> double compared to the IO done in case of a local file system backed
> > by an
> > >>> attached block device.
> > >>>
> > >>> We also noticed that CephFS thrashes through the page cache very
> > quickly
> > >>> compared to the amount of data being read and think that the two issues
> > >>> might be related. So, I wrote a simple test.
> > >>>
> > >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
> > >>> 2. I dropped the page cache completely.
> > >>> 3. I then read these files serially, again using dd. The page cache
> > usage
> > >>> shot up to 39 GB for reading such a small amount of data.
> > >>>
> > >>> Following is the code used to repro this in bash:
> > >>>
> > >>> for i in $(seq 1 10000); do
> > >>>    dd if=/dev/zero of=test_${i} bs=4k count=100
> > >>> done
> > >>>
> > >>> sync; echo 1 > /proc/sys/vm/drop_caches
> > >>>
> > >>> for i in $(seq 1 10000); do
> > >>>    dd if=test_${i} of=/dev/null bs=4k count=100
> > >>> done
> > >>>
> > >>>
> > >>> The ceph version being used is:
> > >>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
> > >>> (stable)
> > >>>
> > >>> The ceph configs being overriden:
> > >>> WHO       MASK  LEVEL     OPTION                                 VALUE
> > >>>       RO
> > >>>    mon           advanced  auth_allow_insecure_global_id_reclaim  false
> > >>>
> > >>>    mgr           advanced  mgr/balancer/mode                      upmap
> > >>>
> > >>>    mgr           advanced  mgr/dashboard/server_addr
> > >>>   127.0.0.1    *
> > >>>    mgr           advanced  mgr/dashboard/server_port              8443
> > >>>      *
> > >>>    mgr           advanced  mgr/dashboard/ssl                      false
> > >>>       *
> > >>>    mgr           advanced  mgr/prometheus/server_addr
> >  0.0.0.0
> > >>>       *
> > >>>    mgr           advanced  mgr/prometheus/server_port             9283
> > >>>      *
> > >>>    osd           advanced  bluestore_compression_algorithm        lz4
> > >>>
> > >>>    osd           advanced  bluestore_compression_mode
> > >>> aggressive
> > >>>    osd           advanced  bluestore_throttle_bytes
> > >>> 536870912
> > >>>    osd           advanced  osd_max_backfills                      3
> > >>>
> > >>>    osd           advanced  osd_op_num_threads_per_shard_ssd       8
> > >>>       *
> > >>>    osd           advanced  osd_scrub_auto_repair                  true
> > >>>
> > >>>    mds           advanced  client_oc                              false
> > >>>
> > >>>    mds           advanced  client_readahead_max_bytes             4096
> > >>>
> > >>>    mds           advanced  client_readahead_max_periods           1
> > >>>
> > >>>    mds           advanced  client_readahead_min                   0
> > >>>
> > >>>    mds           basic     mds_cache_memory_limit
> > >>> 21474836480
> > >>>    client        advanced  client_oc                              false
> > >>>
> > >>>    client        advanced  client_readahead_max_bytes             4096
> > >>>
> > >>>    client        advanced  client_readahead_max_periods           1
> > >>>
> > >>>    client        advanced  client_readahead_min                   0
> > >>>
> > >>>    client        advanced  fuse_disable_pagecache                 false
> > >>>
> > >>>
> > >>> The cephfs mount options (note that readahead was disabled for this
> > test):
> > >>> /mnt/cephfs type ceph
> > >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
> > >>>
> > >>> Any help or pointers are appreciated; this is a major performance issue
> > >>> for us.
> > >>>
> > >>>
> > >>> Thanks and Regards,
> > >>> Ashu Pachauri
> > >>>
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > --
> > Best Regards,
> >
> > Xiubo Li (李秀波)
> >
> > Email: xiu...@redhat.com/xiu...@ibm.com
> > Slack: @Xiubo Li
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS thrashing through the page cache

Reply via email to