Hi Ashu, thanks for the clarification. That's not an option that is easy to change. I hope that the modifications to the fs clients Xiubo has in mind will improve that. Thanks for flagging this performance issue. Would be great if this becomes part of a test suite.
Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Ashu Pachauri <ashu210...@gmail.com> Sent: 17 March 2023 09:55:25 To: Xiubo Li Cc: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: CephFS thrashing through the page cache Hi Xiubo, As you have correctly pointed out, I was talking about the stipe_unit setting in the file layout configuration. Here is the documentation for that for anyone else's reference: https://docs.ceph.com/en/quincy/cephfs/file-layouts/ As with any RAID0 setup, the stripe_unit is definitely workload dependent. Our use case requires us to read somewhere from a few kilobytes to a few hundred kilobytes at once. Having a 4MB default stripe_unit definitely hurts quite a bit. We were able to achieve almost 2x improvement in terms of average latency and overall throughput (for useful data) by reducing the stripe_unit. The rule of thumb is that you want to align the stripe_unit to your most common IO size. > BTW, have you tried to set 'rasize' option to a small size instead of 0 > ? Won't this work ? No this won't work. I have tried it already. Since rasize simply impacts readahead, your minimum io size to the cephfs client will still be at the maximum of (rasize, stripe_unit). rasize is a useful configuration only if it is required to be larger than the stripe_unit, otherwise it's not. Also, it's worth pointing out that simply setting rasize is not sufficient; one needs to change the corresponding configurations that control maximum/minimum readahead for ceph clients. Thanks and Regards, Ashu Pachauri On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li <xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote: On 15/03/2023 17:20, Frank Schilder wrote: > Hi Ashu, > > are you talking about the kernel client? I can't find "stripe size" anywhere > in its mount-documentation. Could you possibly post exactly what you did? > Mount fstab line, config setting? There is no mount option to do this in both userspace and kernel clients. You need to change the file layout, which is (4MB stripe_unit, 1 stripe_count and 4MB object_size) by default, instead. Certainly with a smaller size of the stripe_unit will work. But IMO it will depend and be careful, changing the layout may cause other performance issues in some case, for example too small stripe_unit size may split the sync read into more osd requests to different OSDs. I will generate one patch to make the kernel client wiser instead of blindly setting it to stripe_unit always. Thanks - Xiubo > > Thanks! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Ashu Pachauri <ashu210...@gmail.com<mailto:ashu210...@gmail.com>> > Sent: 14 March 2023 19:23:42 > To: ceph-users@ceph.io<mailto:ceph-users@ceph.io> > Subject: [ceph-users] Re: CephFS thrashing through the page cache > > Got the answer to my own question; posting here if someone else > encounters the same problem. The issue is that the default stripe size in a > cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the > test I posted) inside the file, you'll end up pulling at least 4MB to the > client (and then discarding most of the pulled data) even if you set > readahead to zero. So, the solution for us was to set a lower stripe size, > which aligns better with our workloads. > > Thanks and Regards, > Ashu Pachauri > > > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri > <ashu210...@gmail.com<mailto:ashu210...@gmail.com>> wrote: > >> Also, I am able to reproduce the network read amplification when I try to >> do very small reads from larger files. e.g. >> >> for i in $(seq 1 10000); do >> dd if=test_${i} of=/dev/null bs=5k count=10 >> done >> >> >> This piece of code generates a network traffic of 3.3 GB while it actually >> reads approx 500 MB of data. >> >> >> Thanks and Regards, >> Ashu Pachauri >> >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri >> <ashu210...@gmail.com<mailto:ashu210...@gmail.com>> >> wrote: >> >>> We have an internal use case where we back the storage of a proprietary >>> database by a shared file system. We noticed something very odd when >>> testing some workload with a local block device backed file system vs >>> cephfs. We noticed that the amount of network IO done by cephfs is almost >>> double compared to the IO done in case of a local file system backed by an >>> attached block device. >>> >>> We also noticed that CephFS thrashes through the page cache very quickly >>> compared to the amount of data being read and think that the two issues >>> might be related. So, I wrote a simple test. >>> >>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data). >>> 2. I dropped the page cache completely. >>> 3. I then read these files serially, again using dd. The page cache usage >>> shot up to 39 GB for reading such a small amount of data. >>> >>> Following is the code used to repro this in bash: >>> >>> for i in $(seq 1 10000); do >>> dd if=/dev/zero of=test_${i} bs=4k count=100 >>> done >>> >>> sync; echo 1 > /proc/sys/vm/drop_caches >>> >>> for i in $(seq 1 10000); do >>> dd if=test_${i} of=/dev/null bs=4k count=100 >>> done >>> >>> >>> The ceph version being used is: >>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus >>> (stable) >>> >>> The ceph configs being overriden: >>> WHO MASK LEVEL OPTION VALUE >>> RO >>> mon advanced auth_allow_insecure_global_id_reclaim false >>> >>> mgr advanced mgr/balancer/mode upmap >>> >>> mgr advanced mgr/dashboard/server_addr >>> 127.0.0.1 * >>> mgr advanced mgr/dashboard/server_port 8443 >>> * >>> mgr advanced mgr/dashboard/ssl false >>> * >>> mgr advanced mgr/prometheus/server_addr 0.0.0.0 >>> * >>> mgr advanced mgr/prometheus/server_port 9283 >>> * >>> osd advanced bluestore_compression_algorithm lz4 >>> >>> osd advanced bluestore_compression_mode >>> aggressive >>> osd advanced bluestore_throttle_bytes >>> 536870912 >>> osd advanced osd_max_backfills 3 >>> >>> osd advanced osd_op_num_threads_per_shard_ssd 8 >>> * >>> osd advanced osd_scrub_auto_repair true >>> >>> mds advanced client_oc false >>> >>> mds advanced client_readahead_max_bytes 4096 >>> >>> mds advanced client_readahead_max_periods 1 >>> >>> mds advanced client_readahead_min 0 >>> >>> mds basic mds_cache_memory_limit >>> 21474836480 >>> client advanced client_oc false >>> >>> client advanced client_readahead_max_bytes 4096 >>> >>> client advanced client_readahead_max_periods 1 >>> >>> client advanced client_readahead_min 0 >>> >>> client advanced fuse_disable_pagecache false >>> >>> >>> The cephfs mount options (note that readahead was disabled for this test): >>> /mnt/cephfs type ceph >>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0) >>> >>> Any help or pointers are appreciated; this is a major performance issue >>> for us. >>> >>> >>> Thanks and Regards, >>> Ashu Pachauri >>> > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> > To unsubscribe send an email to > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> > To unsubscribe send an email to > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> > -- Best Regards, Xiubo Li (李秀波) Email: xiu...@redhat.com/xiu...@ibm.com<http://xiu...@redhat.com/xiu...@ibm.com> Slack: @Xiubo Li _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io