[ceph-users] Re: CephFS thrashing through the page cache

Frank Schilder Fri, 17 Mar 2023 03:05:28 -0700

Hi Ashu,

thanks for the clarification. That's not an option that is easy to change. I 
hope that the modifications to the fs clients Xiubo has in mind will improve 
that. Thanks for flagging this performance issue. Would be great if this 
becomes part of a test suite.


Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Ashu Pachauri <ashu210...@gmail.com>
Sent: 17 March 2023 09:55:25
To: Xiubo Li
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: CephFS thrashing through the page cache

Hi Xiubo,

As you have correctly pointed out, I was talking about the stipe_unit setting 
in the file layout configuration. Here is the documentation for that for anyone 
else's reference: https://docs.ceph.com/en/quincy/cephfs/file-layouts/

As with any RAID0 setup, the stripe_unit is definitely workload dependent. Our 
use case requires us to read somewhere from a few kilobytes to a few hundred 
kilobytes at once. Having a 4MB default stripe_unit definitely hurts quite a 
bit. We were able to achieve almost 2x improvement in terms of average latency 
and overall throughput (for useful data) by reducing the stripe_unit. The rule 
of thumb is that you want to align the stripe_unit to your most common IO size.

> BTW, have you tried to set 'rasize' option to a small size instead of 0
> ? Won't this work ?

No this won't work. I have tried it already. Since rasize simply impacts 
readahead, your minimum io size to the cephfs client will still be at the 
maximum of (rasize, stripe_unit).  rasize is a useful configuration only if it 
is required to be larger than the stripe_unit, otherwise it's not. Also, it's 
worth pointing out that simply setting rasize is not sufficient; one needs to 
change the corresponding configurations that control maximum/minimum readahead 
for ceph clients.

Thanks and Regards,
Ashu Pachauri


On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:

On 15/03/2023 17:20, Frank Schilder wrote:
> Hi Ashu,
>
> are you talking about the kernel client? I can't find "stripe size" anywhere 
> in its mount-documentation. Could you possibly post exactly what you did? 
> Mount fstab line, config setting?

There is no mount option to do this in both userspace and kernel
clients. You need to change the file layout, which is (4MB stripe_unit,
1 stripe_count and 4MB object_size) by default, instead.

Certainly with a smaller size of the stripe_unit will work. But IMO it
will depend and be careful, changing the layout may cause other
performance issues in some case, for example too small stripe_unit size
may split the sync read into more osd requests to different OSDs.

I will generate one patch to make the kernel client wiser instead of
blindly setting it to stripe_unit always.

Thanks

- Xiubo


>
> Thanks!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Ashu Pachauri <ashu210...@gmail.com<mailto:ashu210...@gmail.com>>
> Sent: 14 March 2023 19:23:42
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Re: CephFS thrashing through the page cache
>
> Got the answer to my own question; posting here if someone else
> encounters the same problem. The issue is that the default stripe size in a
> cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
> test I posted) inside the file, you'll end up pulling at least 4MB to the
> client (and then discarding most of the pulled data) even if you set
> readahead to zero. So, the solution for us was to set a lower stripe size,
> which aligns better with our workloads.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri 
> <ashu210...@gmail.com<mailto:ashu210...@gmail.com>> wrote:
>
>> Also, I am able to reproduce the network read amplification when I try to
>> do very small reads from larger files. e.g.
>>
>> for i in $(seq 1 10000); do
>>    dd if=test_${i} of=/dev/null bs=5k count=10
>> done
>>
>>
>> This piece of code generates a network traffic of 3.3 GB while it actually
>> reads approx 500 MB of data.
>>
>>
>> Thanks and Regards,
>> Ashu Pachauri
>>
>> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri 
>> <ashu210...@gmail.com<mailto:ashu210...@gmail.com>>
>> wrote:
>>
>>> We have an internal use case where we back the storage of a proprietary
>>> database by a shared file system. We noticed something very odd when
>>> testing some workload with a local block device backed file system vs
>>> cephfs. We noticed that the amount of network IO done by cephfs is almost
>>> double compared to the IO done in case of a local file system backed by an
>>> attached block device.
>>>
>>> We also noticed that CephFS thrashes through the page cache very quickly
>>> compared to the amount of data being read and think that the two issues
>>> might be related. So, I wrote a simple test.
>>>
>>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
>>> 2. I dropped the page cache completely.
>>> 3. I then read these files serially, again using dd. The page cache usage
>>> shot up to 39 GB for reading such a small amount of data.
>>>
>>> Following is the code used to repro this in bash:
>>>
>>> for i in $(seq 1 10000); do
>>>    dd if=/dev/zero of=test_${i} bs=4k count=100
>>> done
>>>
>>> sync; echo 1 > /proc/sys/vm/drop_caches
>>>
>>> for i in $(seq 1 10000); do
>>>    dd if=test_${i} of=/dev/null bs=4k count=100
>>> done
>>>
>>>
>>> The ceph version being used is:
>>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
>>> (stable)
>>>
>>> The ceph configs being overriden:
>>> WHO       MASK  LEVEL     OPTION                                 VALUE
>>>       RO
>>>    mon           advanced  auth_allow_insecure_global_id_reclaim  false
>>>
>>>    mgr           advanced  mgr/balancer/mode                      upmap
>>>
>>>    mgr           advanced  mgr/dashboard/server_addr
>>>   127.0.0.1    *
>>>    mgr           advanced  mgr/dashboard/server_port              8443
>>>      *
>>>    mgr           advanced  mgr/dashboard/ssl                      false
>>>       *
>>>    mgr           advanced  mgr/prometheus/server_addr             0.0.0.0
>>>       *
>>>    mgr           advanced  mgr/prometheus/server_port             9283
>>>      *
>>>    osd           advanced  bluestore_compression_algorithm        lz4
>>>
>>>    osd           advanced  bluestore_compression_mode
>>> aggressive
>>>    osd           advanced  bluestore_throttle_bytes
>>> 536870912
>>>    osd           advanced  osd_max_backfills                      3
>>>
>>>    osd           advanced  osd_op_num_threads_per_shard_ssd       8
>>>       *
>>>    osd           advanced  osd_scrub_auto_repair                  true
>>>
>>>    mds           advanced  client_oc                              false
>>>
>>>    mds           advanced  client_readahead_max_bytes             4096
>>>
>>>    mds           advanced  client_readahead_max_periods           1
>>>
>>>    mds           advanced  client_readahead_min                   0
>>>
>>>    mds           basic     mds_cache_memory_limit
>>> 21474836480
>>>    client        advanced  client_oc                              false
>>>
>>>    client        advanced  client_readahead_max_bytes             4096
>>>
>>>    client        advanced  client_readahead_max_periods           1
>>>
>>>    client        advanced  client_readahead_min                   0
>>>
>>>    client        advanced  fuse_disable_pagecache                 false
>>>
>>>
>>> The cephfs mount options (note that readahead was disabled for this test):
>>> /mnt/cephfs type ceph
>>> (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
>>>
>>> Any help or pointers are appreciated; this is a major performance issue
>>> for us.
>>>
>>>
>>> Thanks and Regards,
>>> Ashu Pachauri
>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
>
--
Best Regards,

Xiubo Li (李秀波)

Email: xiu...@redhat.com/xiu...@ibm.com<http://xiu...@redhat.com/xiu...@ibm.com>
Slack: @Xiubo Li

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS thrashing through the page cache

Reply via email to